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Abstract 

How  do  contagions  spread  in  population  networks?  What  happens  if  the 
networks  change  with  time?  Which  hospitals  should  we  give  vaccines  to,  for 
maximum  effect?  How  to  detect  sources  of  rumors  on  Twitter/Facebook? 
These  questions  and  many  others  such  as  which  group  should  we  market 
to,  for  maximizing  product  penetration,  how  quickly  news  travels  in  online 
media  and  how  the  relative  frequencies  of  competing  tasks  evolve  are  all 
related  to  propagation/ cascade-like  phenomena  on  networks. 

In  this  thesis,  we  present  novel  theory,  algorithms  and  models  for  propa¬ 
gation  processes  on  large  static  and  dynamic  networks,  focusing  on: 

1.  Theory:  We  tackle  several  fundamental  questions  like  determining  if 
there  will  be  an  epidemic,  given  the  underlying  networks  and  virus 
propagation  models  and  predicting  who- wins  when  viruses  (or  memes 
or  products  etc.)  compete.  We  give  a  unifying  answer  for  the  threshold 
based  on  eigenvalues,  and  prove  the  surprising  wi nner-ta kes-a  1 1 '  result 
and  other  subtle  phase-transitions  for  competition  among  viruses. 

2.  Algorithms:  Based  on  our  analysis,  we  give  dramatically  better  algo¬ 
rithms  for  important  tasks  like  effective  immunization  and  reliably  de¬ 
tecting  culprits  of  epidemics.  Thanks  to  our  carefully  designed  algo¬ 
rithms,  we  achieve  6x  fewer  infections  on  real  hospital  patient-transfer 
graphs  while  also  being  significantly  faster  than  other  competitors  (upto 
30,000x). 

3.  Models:  Finally  using  our  insights,  we  study  numerous  datasets  to  de¬ 
velop  powerful  general  models  for  information  diffusion  and  competing 
species  in  a  variety  of  situations.  Our  models  unify  earlier  patterns 
and  results,  yet  being  succinct  and  enable  challenging  tasks  like  trend 
forecasting,  spotting  outliers  and  answering  'what-if'  questions. 

Our  inter-disciplinary  approach  has  led  to  many  discoveries  in  this  the¬ 
sis,  with  broad  applications  spanning  areas  like  public  health,  social  media, 
product  marketing  and  networking.  We  are  arguably  the  first  to  present  a 
systematic  study  of  propagation  and  immunization  of  single  as  well  as  multiple 
viruses  on  arbitrary,  real  and  time-varying  networks  as  the  vast  majority  of  the 
literature  focuses  on  structured  topologies,  cliques,  and  related  un-realistic 
models. 
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Chapter  1 
Introduction 


This  thesis  involves  the  study  of  propagation  processes  on  large  graphs.  Will  a  specific 
YouTube  video  go  viral?  Given  a  who-contacts-whom  network  and  a  virus  propagation 
model,  can  we  predict  whether  there  will  be  an  epidemic?  Which  are  the  best  nodes 
(people,  computers  etc.)  to  immunize,  to  slow  down  and  prevent  an  epidemic  as  soon 
as  possible?  Such  problems  are  central  in  surprisingly  diverse  areas:  from  cyber  security, 
epidemiology  and  public  health,  product  marketing  to  information  dissemination.  Answering 
these  questions  involves  the  study  of  aggregated  dynamics  over  complex  connectivity 
patterns.  The  proliferation  of  Internet  and  Web  2.0  and  social  networks  like  Facebook, 
Twitter,  Flickr  etc.  coupled  with  more  biological  data  and  simulations  has  afforded  the 
opportunity  to  study  dynamical  processes  on  a  scale  unimaginable  before.  Understand¬ 
ing  such  processes  will  eventually  enable  us  to  manipulate  them  for  our  benefit  e.g.,  a 
better  understanding  of  the  dynamics  of  epidemic  spreading  over  graphs  allows  us  to 
devise  more  robust  policies  for  immunization.  Social-network  websites  like  Facebook 
count  more  than  900  Million  users  and  1  Billion  US  Dollars  in  revenue.  Hospital-acquired 
infections  take  more  than  99  thousand  lives  and  cost  more  than  5  Billion  US  Dollars  per 
year.  The  societal  impact  of  networked-collaboration  during  political  events  like  'Arab 
Spring'  have  also  been  well-documented.  Hence  research  in  this  area,  helping  us  answer 
questions  like  how  information  spreads  through  social  media,  and  how  to  distribute  a 
given  amount  of  resource  like  antibiotics  across  hospitals,  holds  great  scientific,  social  as 
well  as  commercial  value. 

This  thesis  gives  new  theory,  better  algorithms  and  improved  models  for  propagation 
processes  on  large  real-world  networks.  The  next  section  gives  an  overview  of  the  thesis, 
after  which  we  present  a  summary  of  the  major  contributions. 


1.1  Motivation  and  Overview 

Graphs — also  known  as  networks — are  powerful  tools  for  modeling  processes  and  situ¬ 
ations  of  interest  in  real-life  like  social-systems,  cyber-security,  epidemiology,  biology 
etc.  They  are  ubiquitous,  from  online  social  networks,  gene-regulatory  networks,  to 
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router  graphs.  They  effectively  model  a  wide  range  of  phenomena,  as  they  expose 
local-dependencies  and  capture  large-scale  structure  at  the  same  time.  For  example, 
online  social  networks  have  become  essential  for  marketing,  collaborative  action;  gene- 
regulatory  networks  help  in  understanding  the  inner  workings  of  the  cell;  modeling 
traffic  on  AS-router  graphs  helps  us  design  a  more  efficient  Internet.  In  addition,  dynam¬ 
ical  processes1  over  them  can  give  rise  to  astonishing  macroscopic  behavior,  leading  to 
challenging  and  exciting  research  problems.  How  do  contagions  spread  in  population 
networks?  Which  group  should  we  market  to,  for  maximizing  product  penetration?  How 
stable  is  a  predator-prey  ecosystem,  given  intricate  food  webs?  How  do  rumors  spread 
on  Twitter/Facebook?  Questions  such  as  how  blackouts  can  spread  on  a  nationwide 
scale,  or  how  social  systems  evolve  on  the  basis  of  individual  interactions,  are  all  also 
related  to  dynamical  phenomena  on  networks.  'Big-Data'  is  a  natural  and  necessary  part 
of  research  in  this  sphere.  Although  the  actions  of  a  particular  individual  or  component 
may  be  too  difficult  to  model,  data  mining,  and  machine  learning  can  be  applied  to 
large  groups  or  ensembles,  which  can  yield  effective  models  with  the  ability  to  predict 
future  events.  For  example,  modeling  the  behavior  of  every  individual  to  a  marketing 
strategy  might  be  too  difficult,  but  modeling  the  behavior  of  large  and  groups  of  people 
based  on  demographics  and  geography  is  feasible.  Hence,  we  can  try  to  answer  even 
more  complex  issues  using  these  models  e.g..  How  should  we  distribute  resources  to 
control  an  epidemic?  And  these  policies  have  to  be  designed  in  a  way  that  they  can  be 
implemented  on  an  extremely  large-scale. 

This  thesis  tackle  several  natural  and  fundamental  problems  in  epidemic-style  propa¬ 
gation  (like,  say  'word-of-mouth'  viral  marketing)  on  large  real  graphs.  In  addition,  due 
to  the  sheer  reach  of  the  problems,  an  inter-disciplinary  approach  is  vital  here — this  thesis 
involves  work  with  collaborators  spanning  social  media,  medicine  and  public  health, 
mobile  and  internet  networking  and  online  services  and  operations.  Our  research  has 
been  inherently  multi-pronged  combining: 

I  (Theory)  Analyzing  theoretical  models  of  propagation  processes 
II  (Algorithms)  Developing  scalable  algorithms  to  manage  the  processes 
III  (Models)  Using  massive  datasets  to  make  better  models 
Thus  this  thesis  essentially  breaks  down  into  three  parts,  each  of  which  we  summarize 
in  the  next  few  subsections.  We  want  to  highlight  that  these  parts  are  all  symbiotic  and 
closely  related.  For  example,  once  we  collect  and  learn  models  of  a  disease  from  real- 
data,  we  can  predict  its  tipping  point  (through  analysis)  and  subsequently  leverage  it  for 
immunization  (increase  the  tipping  point  as  much  as  possible). 

1.1.1  Thesis  Statement 

Substantially  better  algorithms  can  be  designed  for  cascade  management  and  immu¬ 
nization  through  careful  and  novel  analysis  of  virus  propagation  models. 


1  Intuitively,  where  the  state  (or  action)  of  an  agent  depends  on  the  states  (actions)  of  its  neighbors. 
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More  specifically,  we  systematically  analyze  fundamental  epidemic  models  in  a  variety 
of  situations  (static /dynamic  graphs,  single/ multiple  viruses)  to  better  understand  the 
effect  of  graph  topology  on  various  propagation  processes  (like  epidemic  thresholds) 
and  subsequently  leverage  our  results  to  carefully  design  better  algorithms  for  many 
tasks  like  immunization  (like  SMART- ALLOC).  Finally,  using  our  insights,  we  build  better 
models  for  describing  propagation  scenarios  (like  SPIKEM)  to  match  real  data. 

1.1.2  [Part  I]  Theory:  Chapters  2,  3,  4,  5 

This  part  of  the  thesis  is  devoted  to  gaining  a  deeper  understanding  of  abstract  epidemic 
models.  Models  help  us  abstract  out  the  process  and  allow  us  to  reason  more  generally 
about  them.  In  this  part,  we  tackled  important  questions  like  understanding  the  tipping 
point  behavior  of  epidemics,  predicting  ivho-ivins  among  competing  viruses /products, 
which  have  immediate  and  broad  applications,  like  selecting  targets  for  advertising 
and  marketing  and  selecting  people  to  inoculate  to  stop  an  epidemic.  In  contrast  to 
previous  work,  our  analysis  focuses  on  arbitrary  underlying  graphs,  leading  to  more 
readily  applicable  results. 

Chapter  2  (Epidemic  thresholds  for  static  graphs):  The  main  question  we  answer  is: 
will  there  be  an  epidemic,  given  the  graph  and  the  virus  propagation  model?  We 
show  (see  Theorem  2.1)  that  the  threshold  condition  is  (Ai  is  the  first  eigenvalue  of  the 
connectivity  matrix,  C  is  a  virus-model  dependent  constant): 


Ai  ■  C  <  1, 


for  (a)  any  graph;  and  (b)  all  propagation  models  in  standard  literature  (more  than  25 
from  canonical  texts,  including  the  AIDS  virus  H.I.V. ).  Our  result  de-couples  the  effect  of 
the  topology  and  the  virus  model,  and  also  unifies  and  subsumes  older  results,  which 
mostly  focused  on  special  graphs  or  virus  models.  We  are  the  first  to  show  the  epidemic 
threshold  on  arbitrary  graphs  and  almost  any  virus  propagation  model.  Our  discovery 
has  broad  implications  and  applications  like  faster  epidemiological  simulations  and  blog 
cascades  (like  the  award-winning  Independent  Cascade  model  is  a  special  case  of  our 
generalization). 

Chapter  3  (Epidemic  thresholds  for  dynamic  graphs):  Social-contacts  are  not  constant, 
and  therefore  it  is  more  realistic  to  have  graphs  which  change  with  time  (say,  day  vs.  night 
connectivity).  While  static  graphs  have  been  studied  for  a  long  time,  with  numerous 
analytical  results,  time-evolving  networks  are  so  hard  to  analyze,  that  most  existing 
works  are  simulation  studies.  Most  existing  works  are  simulation  studies,  as  propagation 
models  on  time-evolving  networks  are  so  hard  to  analyze.  We  show  that  the  epidemic 
threshold  of  the  "flu-like"  SIS  model  on  any  set  of  time-varying,  arbitrary  graphs  depends 
only  on  the  largest  eigenvalue  of  a  so-called  'system'  matrix  (see  Theorem  3.1). 
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SIRS  Infected  (loq-loq) 


SIRS  Threshold 


(a) 


(b) 


Figure  1.1:  The  tipping-point  exactly  matches  our  prediction:  simulation  results  on  a  massive 
social-contact  graph  PORTLAND  (31  mil.  edges,  1.5  mil.  nodes)  and  the  SIRS  model  (temporary 
immunity  like  pertussis),  (a)  Plot  of  Infected  Fraction  of  Population  vs  Time  (log-log).  Note  the 
qualitative  difference  in  behavior-  under  (green)  the  threshold  and  above  (red)  the  threshold,  (b) 
Footprint  (expected  final  epidemic  size)  vs  Effective  Strength  (lin-log).  Notice  our  prediction  is 
exactly  at  the  take-off  point. 


Chapter  4  (Mutually  exclusive  competing  viruses):  In  this  chapter,  we  shift  our  focus 
to  competing  viruses  spreading  over  networks.  Given  two  competing  products  such  as 
iPhone /Android,  and  'word  of  mouth'  adoption  of  them,  what  will  happen  in  the  end? 
Will  they  split  the  market,  in  proportion  to  their  quality?  i.e.  which  product  will  'win',  in 
terms  of  highest  market  share?  This  question  is  of  interest  in  numerous  other  settings  too, 
e.g.,  the  common  flu  versus  avian  flu,  competing  memes,  theories  and  so  on.  One  may 
naively  expect  that  the  better  product  (stronger  virus)  will  just  have  a  larger  footprint, 
proportional  to  the  quality  ratio  of  the  products  (or  strength  ratio  of  the  viruses).  We 
prove  the  surprising  result  (see  Theorem  4.1)  that,  under  realistic  conditions,  for  any 
graph,  the  stronger  virus  completely  wipes-out  the  weaker  virus  ('winner-takes-all').  We 
demonstrate  it  through  case-studies  using  real  data  too. 


Chapter  5  (Co-existence  with  competing  viruses):  Following  from  the  previous  chap¬ 
ter,  a  natural  question  is  to  understand  what  happens  when  the  competing  viruses  are 
not  mutually  exclusive?  For  example,  using  one  web-browser  (say  I.E.)  does  not  au¬ 
tomatically  imply  not  using  the  other  (say  Chrome).  We  show  (see  Theorem  5.1)  that 
there  is  a  phase-transition:  if  the  competition  is  harsher  than  a  critical  threshold,  then 
'winner-takes-all',  otherwise,  the  weaker  virus  survives.  Our  contributions  [BPRF12] 
include  the  problem  definition  (which  is  unique  even  in  epidemiological  literature),  and 
experiments  on  real-data  demonstrating  our  result. 
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1.1.3  [Part  II]  Algorithms:  Chapters  6,  7,  8,  9 

This  part  of  the  thesis  is  devoted  to  developing  fas  t  and  effective  algorithms  for  a  variety 
of  tasks  w.r.t.  propagation:  immunization,  edge-placement  and  finding  culprits  of  epi¬ 
demics.  Such  problems  naturally  arise  in  epidemiology  (Vaccination  programs'),  social 
media  ('detecting  rumor  sources')  and  cyber  security  ('designing  worms').  Interestingly, 
our  previous  work  on  thresholds  in  various  settings  above  give  a  clear  guideline  for 
controlling  (harmful  viruses)  or  speeding-up  (product  marketing)  propagation  via  net¬ 
work  manipulation:  minimize  (or  maximize)  the  leading  eigenvalue  of  a  suitable  matrix 
(adjacency  matrix  in  case  of  static  graphs;  the  so-called  system  matrix  in  case  of  dynamic 
graphs).  This  is  in  contrast  to  previous  work,  where  complex  optimization  functions 
were  used.  Unfortunately,  we  also  prove  that  our  problems  are  computationally  hard 
(NP-complete).  We  exploit  the  task-specific  structure  to  get  fast  ( linear-time  in  edges  and 
the  budget)  and  accurate  algorithms,  substantially  improving  the  state-of-the-art. 

Chapter  6  (Immunization  as  node-removal):  Given  a  large  network,  like  a  computer 
communication  network,  which  k  nodes  should  we  remove  (or  monitor,  or  immunize), 
to  make  it  as  robust  as  possible  against  a  computer  virus  attack?  Making  careful  approx¬ 
imations,  we  exploit  the  submodular  structure  of  the  set  of  possible  solutions,  getting 
a  provably  near-optimal  algorithm  NETSHIELD  (see  Algorithm  1),  which  outperforms 
many  methods  by  more  than  7  orders  of  magnitude  in  running  time,  and  competitors 
like  the  well-known  acquaintance  immunization  in  quality  of  solutions.  A  similar  ques¬ 
tion  arises  in  case  of  viral  propagation  over  time-varying  dynamic  graphs.  We  develop 
fast  heuristics  for  complete  immunization  in  case  of  time-varying  graphs  as  well  and 
demonstrated  their  effectiveness  (see  Section  6.8). 

Chapter  7  (Fractional  Immunization):  Given  a  fixed  amount  of  medicines  with  partial 
impact,  how  should  they  be  distributed?  Collaborating  with  domain  experts  at  UMichi- 
gan,  we  studied  controlling  the  spread  of  bacteria  between  hospitals  through  patient 
transfers,  by  distributing  scarce  infection-control  resources  (which  only  have  partial 
impact  on  any  hospital),  it  is  NP-complete,  and  develop  Smart-Alloc,  a  near-optimal 
and  fast  algorithm.  Smart-Alloc  runs  in  seconds  on  commodity  hardware,  as  opposed 
to  weeks  required  for  other  approaches.  Most  importantly,  when  applied  on  real  hospi¬ 
tal  patient-transfer  networks  like  US-MEDICARE,  it  results  in  6  times  fewer  infections. 
Figure  1.2  illustrates  our  results  (each  resource  roughly  halved  the  susceptibility  of  a 
hospital  and  the  same  amount  (200)  were  distributed). 

Chapter  8  (General  Edge-placement):  Which  people  should  be  introduced  to  each 
other,  to  maximize  the  spread  of  a  crucial  piece  of  information?  Which  people  should 
be  'un-friended'  to  contain  dissemination  of  malware  over  Facebook?  We  study  the 
edge-placement  problem:  which  edges  should  we  add  or  delete  in  order  to  speed-up 
or  contain  a  dissemination?  For  addition  of  edges,  things  are  even  more  challenging 
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(a)  US-MEDICARE  Inter-  (b)  Infected  Hospitals  (in  red)  (c)  Infected  Hospitals  (in  red) 
hospital  Connections  under  current  practice  under  our  SMART-ALLOC 


Figure  1.2:  Our  proposed  Smart-Alloc  method  has  6x  fewer  infections  (red  circles):  (a)  The 
US-MEDICARE  network  of  hospitals  overlayed  on  a  map  (b)  Infected  hospitals  after  a  year  (365 
days)  under  current  practice,  (c)  Similarly,  under  SMART-ALLOC.  The  current  practice  allocates 
equal  amounts  of  resource  to  each  hospital. 


because  of  its  intrinsic  quadratic  time  complexity.  We  propose  effective  and  near  linear-time 
algorithms  to  solve  these  problems.  We  also  study  the  two  problems  and  our  methods 
theoretically,  the  accuracy  and  complexity  of  our  methods,  and  the  equivalence  between 
different  strategies  (edge  vs.  node-deletion).  To  the  best  of  our  knowledge,  we  are  the 
first  to  study  the  edge-placement  problem. 

Chapter  9  (Finding  culprits  of  epidemics):  Can  we  identify  sources  of  rumors  on  Twit¬ 
ter?  Or  given  a  snapshot  of  a  large  graph,  in  which  an  infection  has  been  spreading  for 
some  time,  can  we  reliably  identify  those  nodes  (both  in  number  and  identity)  from  which 
the  infection  started  to  spread?  In  this  chapter  we  answer  this  question  affirmatively, 
and  give  an  efficient  method  called  NetSleuth  for  the  well-known  Susceptible-Infected 
virus  propagation  model,  which  automatically  finds  out  both  the  number  and  identity  of 
the  seeds  which  best-describe  the  epidemic.  Experimentation  on  our  method  [PVF12] 
shows  high  accuracy  in  the  detection  of  seed  nodes,  in  addition  to  the  correct  automatic 
identification  of  their  number.  Moreover,  we  show  NetSleuth  scales  linearly  in  the 
number  of  nodes  of  the  graph,  in  contrast  to  existing  methods. 

1.1.4  [Part  III]  Models:  Chapters  10, 11 

In  this  part,  we  study  numerous  real-datasets  to  build  better  models  in  domains  such 
as  propagation  of  memes  in  online  media  and  competing  tasks  in  everyday  life.  We 
also  show,  as  a  bonus,  how  to  use  such  models  for  varied  challenging  applications  like 
forecasting  trends  activity  and  spotting  outliers  like  telemarketers. 

Chapter  10  (Rise  and  fall  patterns):  While  models  in  epidemiology  have  been  widely 
studied  and  accepted,  the  models  describing  exactly  how  information  diffuses  in  online 


6 


media  is  uncertain.  Here  we  ask  a  very  simple  question:  How  quickly  does  a  piece  of 
news  spread  over  these  media?  How  does  its  popularity  diminish  over  time?  Does  the 
rising  and  falling  pattern  follow  a  simple  universal  law?  We  propose  SpikeM  [MSP+12],  a 
concise  yet  flexible  model,  which  generalizes  and  unifies  previous  models  and  observations, 
and  excels  at  challenging  tasks  like  forecasting,  spotting  anomalies  etc.  We  show  the  power 
of  SPIKEM  through  the  analysis  of  more  than  7.2GB  of  real  data. 


Figure  1.3:  SpikeM  at  work:  Results  of  "what-if"  forecasting  for  the  Harry  Potter  series.  We 
trained  parameters  by  using  (a)  the  first  spike  around  July  15,  2009  (black  solid  line),  and  (b) 
access  volume  two  months  before  the  release  (blue  lines  with  double  arrows  around  time  u  =  250, 
280)  and  then,  forecasted  the  following  two  spikes  (red  lines). 


Chapter  11  (Competing  Tasks):  If  Alice  has  double  the  friends  of  Bob,  will  she  also 
have  double  the  phone-calls  (or  wall-postings,  or  tweets)?  Analyzing  data  containing 
millions  of  users,  we  show  that  the  answer  to  the  question  is  a  power-law:  sub-linear,  or 
super-linear,  for  a  wide  variety  of  diverse  settings:  tasks  in  a  phone-call  network,  like 
count  of  friends,  count  of  phone-calls,  total  count  of  minutes;  tasks  in  a  twitter-like 
network,  like  count  of  tweets,  count  of  followees  etc.  Additionally,  based  on  competing 
species  theories,  we  give  a  simple  "competing  tasks"  model  (CTM),  that  leads  exactly 
to  power-law  relationships  between  task-frequencies,  and  show  how  to  use  it  to  spot 
telemarketers. 


1.2  Contributions  and  Impact 

We  give  the  major  contributions  and  impact  of  the  work  in  this  thesis  next.  We  are 
arguably  the  first  to  present  a  systematic  study  of  propagation  and  immunization  of 
single  as  well  as  multiple  viruses  on  arbitrary,  real  and  time-varying  networks  as  the  vast 
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majority  of  the  literature  focuses  on  structured  topologies,  cliques,  and  related  un-realistic 
models. 

Theory 

•  Eigenvalues  for  Epidemic  Threshold:  We  are  the  first  to  show  the  epidemic  thresh¬ 
old  on  arbitrary  graphs  and  almost  any  virus  propagation  model.  Our  eigenvalue 
result  generalizes  and  unifies  previous  results  and  has  broad  implications  and 
applications  like  faster  epidemiological  simulations.  We  also  derive  the  first  closed 
formula  for  any  set  of  arbitrary  time-varying  graphs.  In  contrast,  most  past  work 
has  used  simulations. 

•  Winner-Takes-All  for  Competing  Viruses:  We  are  the  first  to  prove  for  arbitrary 

graphs  that  winner-takes-all  in  competing  viruses/products.  Additionally,  we 
extend  this  problem  to  mutually-interacting  viruses  to  show  a  phase-transition. 

•  Impact:  Our  paper  on  epidemic  thresholds  on  static  graphs  [PCF+11]  was  selected 
for  one  of  the  best  papers  of  the  conference.  Our  results  have  been  used  to  enable 
other  important  tasks  like  anomaly  detection  and  graph  modeling  [AMF10],  and 
immunization  (see  Part  II  of  this  thesis).  They  are  also  being  incorporated  into 
FRED,  an  epidemiological  simulator  developed  by  MIDAS. 

Algorithms 

•  Dramatically  better  Immunization:  Our  algorithms  such  as  NETSHIELD  and 
SMART-ALLOC  solve  the  complete  and  fractional  immunization  problems  respec¬ 
tively,  achieving  significant  savings.  NETSHIELD  outperformed  many  methods  by 
more  than  7  orders  of  magnitude  in  running  time,  and  competitors  like  the  well- 
known  acquaintance  immunization  in  quality  of  solutions.  Additionally,  on  real 
hospital  patient-transfer  networks  like  US-MEDICARE,  Smart-Alloc  achieves 
up  to  6x  fewer  infections  and  30,000x  speed-up,  over  current  practice  and  ad-hoc 
heuristics.  In  contrast,  the  current  practice  in  control  of  highly  resistant  organisms 
via  patient  transfers  has  been  largely  focused  within  individual  hospitals. 

•  Parameter-free  Culprits  detection:  Our  algorithm  NETSLEUTH  is  the  first  linear¬ 
time  algorithm  (in  edges  and  nodes)  for  both  identifying  the  set  of  culprits  for  which 
best  describes  a  given  snapshot  of  the  epidemic  and  for  automatically  selecting  the 
best  number  of  seed  nodes — in  contrast  to  the  state  of  the  art  (which  are  at  least 
quadratic,  and  require  the  number  of  seeds  as  input). 

•  Impact:  Our  results  and  algorithms  have  been  incorporated  into  undergraduate 
courses  (UPitt  Summer  Program)  and  slides  sought-after  for  graduate  courses 
(Xifeng  Yan,  UCSB)  in  universities,  and  have  appeared  in  ACM  Crossroads. 

Models 

•  Unifying  models  for  online  diffusion:  We  develop  SpikeM,  a  powerful  model 
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to  explain  the  rise  and  fall  patterns  of  information  diffusion  which  unifies  and 
includes  earlier  patterns  and  models,  is  succint,  matches  behavior  of  numerous 
real  datasets  and  can  be  used  to  forecast,  answer  'what-if'  scenarios  and  even 
reverse-engineer  epidemics. 

•  Explaining  power-laws  in  competing  tasks:  We  develop  CTM,  an  intuitive  model 
to  explain  the  prevalence  of  super-linear  relationships  between  the  frequencies  of 
various  competing  tasks  observed  in  real-datasets,  and  use  it  to  spot  outliers  like 

telemarketers. 
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Part  I 
Theory 
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Overview 


This  part  deals  with  the  analysis  of  epidemic-like  models  on  networks.  We  answer  two 
main  questions  here: 

•  Epidemic  Thresholds:  Given  a  network  of  who-contacts-whom  or  who-links- 
to-whom,  will  a  contagious  virus  (or  product  or  meme)  spread  and  'take-over' 
(cause  an  epidemic)  or  die-out  quickly?  What  will  change  if  nodes  have  partial, 
temporary  or  permanent  immunity?  The  epidemic  threshold  is  the  minimum  level 
of  virulence  to  prevent  a  viral  contagion  from  dying  out  quickly  and  determining  it 
is  a  fundamental  question  in  epidemiology  and  related  areas.  For  the  static  graphs, 
we  show  that  the  threshold  condition  is  (Ai  is  the  first  eigenvalue  of  the  connectivity 
matrix,  C  is  a  virus-model  dependent  constant):  Ai  ■  C  <  1,  for  (a)  any  graph;  and 
(b)  all  propagation  models  in  standard  literature  (more  than  25  from  canonical  texts, 
including  the  AIDS  virus  H.I.V.).  Additionally,  for  any  set  of  arbitrary  time-varying 
graphs,  we  show  that  the  threshold  depends  only  on  the  largest  eigenvalue  of  a 
so-called  'system'  matrix. 

•  Competing  Viruses:  Given  two  competing  products  such  as  iPhone/ Android,  and 
'word  of  mouth'  adoption  of  them,  what  will  happen  in  the  end?  Will  they  split 
the  market,  in  proportion  to  their  quality?  We  prove  the  surprising  result  that, 
under  realistic  conditions,  for  any  graph,  the  stronger  virus  completely  wipes-out 
the  weaker  virus  ( winner-takes-all ).  Further,  we  extend  this  analysis  for  viruses 
interacting  more  subtly  and  prove  the  existence  of  a  phase  transition  for  co-existence. 
We  demonstrate  these  results  through  case-studies  using  real  data  too. 

These  are  natural  and  fundamental  questions,  and  our  results  in  this  part  have  broad 
applications,  like  faster  epidemiological  simulations,  better  immunization  algorithms 
and  prediction  (some  of  which  we  will  see  also  in  the  subsequent  parts  of  this  thesis). 
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Chapter  2 

Epidemic  Thresholds:  Static  Graphs  and 
Arbitrary  Models 


Given  a  network  of  who-contacts-whom  or  who-links-to-whom,  will  a  contagious  virus 
(or  product  or  meme)  spread  and  'take-over'  (cause  an  epidemic)  or  die-out  quickly? 
What  will  change  if  nodes  have  partial,  temporary  or  permanent  immunity?  The  epi¬ 
demic  threshold  is  the  minimum  level  of  virulence  to  prevent  a  viral  contagion  from 
dying  out  quickly  and  determining  it  is  a  fundamental  question  in  epidemiology  and 
related  areas.  Networks  with  lower  thresholds  are  susceptible  to  weak  contagions.  Con¬ 
versely,  raising  the  threshold  (e.g.,  through  immunization),  protects  the  network  against 
attacks.  Most  earlier  work  focuses  either  on  special  types  of  graphs  or  on  specific  epi¬ 
demiological/cascade  models.  In  this  chapter,  we  are  the  first  to  show  the  G2-threshold 
(twice  generalized)  theorem,  which  nicely  de-couples  the  effect  of  the  topology  and  the 
virus  model.  Our  result  unifies  and  includes  as  special  case  older  results  and  shows 
that  the  threshold  depends  on  the  first  eigenvalue  of  the  connectivity  matrix,  (a)  for  any 
graph  and  (b)  for  all  propagation  models  in  standard  literature  (more  than  25,  including 
H.I.V. ). 

Our  discovery  has  broad  implications  for  the  vulnerability  of  real,  complex  networks, 
and  numerous  applications,  including  viral  marketing,  blog  dynamics,  influence  prop¬ 
agation,  easy  answers  to  'what-if'  questions,  and  simplified  design  and  evaluation  of 
immunization  policies.  We  also  demonstrate  our  result  using  extensive  simulations  on 
real  networks,  including  on  one  of  the  biggest  available  social-contact  graphs  containing 
more  than  31  million  interactions  among  more  than  1  million  people  representing  the  city 
of  Portland,  Oregon,  USA. 


2.1  Introduction 

Given  a  social  or  computer  network,  where  the  links  represent  who  has  the  potential  to 
infect  whom,  what  can  we  say  about  its  epidemic  threshold?  That  is,  can  we  determine 
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SIRS  Infected  (log-log) 


Figure  2.1:  Qualitatively  different  infection  time-series  curves  (Fraction  of  Infected  population  vs 
Time)  for  the  SIRS  model  (temporary  immunity,  like  pertussis)  on  a  large  contact-network.  What 
is  the  condition  that  separates  the  two  regimes  -  red  (epidemic)  vs  green  (extinction)? 


whether  a  small  infection  can  'take-off'  and  create  an  epidemic?  What  will  change  if  the 
nodes  have  permanent,  temporary  or  no  immunity?  Both  the  underlying  contact-network 
(or  the  population  structure)  and  the  particular  cascade  (propagation)  model  should 
intuitively  play  an  important  role  in  the  spread  of  contagions  (viruses/ memes/ products). 
Finding  the  epidemic  threshold  for  an  arbitrary  network  is  an  important  and  funda¬ 
mental  question  in  epidemiology  and  related  areas.  For  instance.  Figure  2.1  shows 
the  simulation  output  after  running  the  SIRS  model  (Susceptible-Infectious-Recovered- 
Susceptible  which  models  diseases  with  temporary  immunity  like  pertussis)  on  a  large 
contact-network  for  different  values  of  the  virulence  of  the  virus  (achieved  by  tuning 
the  parameters  of  the  model).  We  can  clearly  see  two  different  regimes  -  the  fast  die-out 
green  regime  and  the  steady-state  epidemic  red  regime.  This  chapter  deals  with  finding 
the  condition  which  separates  these  two  regimes  in  SIRS,  as  well  as  in  all  other  virus 
propagation  models  in  standard  literature  [HetOO,  EK10],  on  arbitrary  contact-networks. 

Much  of  previous  work  focuses  on  either  special  types  of  graphs  (typically  cliques  [KW93], 
block-structure  and  hierarchical  graphs  [FIY84]  and  random  power-law  graphs  [PSV01]) 
or  on  specific  epidemiological  models  [CWW+08].  We  unify  and  include  as  special-case 
older  results  in  two  orthogonal  directions  and  show: 

•  De-coupling :  the  threshold  condition  separates  the  effect  of  topology  and  the  virus 
model, 

•  Arbitrary  Topology:  the  threshold  depends  on  the  first  eigenvalue  of  the  connectivity 
matrix, 

•  Arbitrary  VPM:  the  threshold  depends  on  one  constant  that  completely  characterizes 
the  virus  propagation  model  (VPM) 

Our  result  has  numerous  applications  and  immediate  implications  (see  §  2.8)  including 
easy  answers  to  'what-if'  questions  and  simplified  design  and  evaluation  of  immu¬ 
nization  policies.  Moreover,  a  variety  of  dynamic  processes  on  graphs  are  modeled  like 
epidemic  spreading  and  hence  our  result  applies  to  many  of  them.  For  example,  the  linear- 
cascade  model  [KKT03]  is  essentially  the  SIR  model  (Susceptible-Infected-Recovered, 
models  chicken  pox,  see  Figure  2.2  (left  inset)  for  state  diagram);  also,  so-called  threshold 
models  (like  Granovetter's  model  [Gra78])  in  sociology  are  similar  in  reality  to  cascade 
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models  [DW04],  In  contrast  to  harmful  viruses,  the  propagation  of  some  contagions  may 
in  fact  be  desirable  e.g.,  dissemination  of  a  product  or  an  idea  in  a  network  of  individuals. 
For  example,  the  Bass  model  [Bas69]  fits  product  adoption  data  using  parameters  for 
pricing  and  marketing  effects.  However  it  ignores  topology;  it  simply  assumes  that  all 
adopters  have  equal  probability  of  influencing  non-adopters.  Instead,  using  our  result,  a 
more  refined  picture  can  be  constructed  of  when  a  product  gains  massive  adoption  on  a 
social  network  (equivalent  to  an  "epidemic"). 

Several  VPMs  have  direct  applications  in  modeling  computer  and  email  viruses  [Kle07, 
HMM03].  In  these  cases,  more  so  than  the  biological  ones,  it  is  easier  to  get  the  entire 
underlying  network.  Hence  our  threshold  results  can  be  used  to  make  the  network  more 
robust  by  "immunizing"  a  few  carefully  chosen  computers  in  the  network  (like  installing 
a  firewall  on  them).  Another  application  is  the  efficient  spreading  of  software  patches 
over  a  computer  network.  The  patches  behave  like  computer  worms  [VGKG08]  and  can 
help  defend  against  other  malicious  worms.  Given  full  knowledge  of  the  router-network 
involved,  we  can  then  estimate  how  "infectious"  the  patch-worm  has  to  be  (say  by 
increasing  the  number  of  probes  for  possible  hosts  before  dying  out)  to  at  least  initiate  an 
"epidemic"  w.r.t.  the  patch.  Additionally,  we  can  help  determine  the  vulnerability  and 
consequently  the  cost  of  not  patching  parts  of  the  network.  Various  epidemic  models 
have  also  been  used  to  model  blog  cascades  which  can  now  be  applied  to  arbitrary 
graphs  e.g.,  to  study  the  propagation  of  memes  through  blogs  [LBK09]. 

The  rest  of  the  chapter  is  organized  as  follows:  we  first  give  the  related  work  in  §  2.2, 
then  formulate  the  problem  (§  2.3)  and  state  our  main  result  (§  2.4),  give  a  proof  roadmap 
and  example  (§  2.5)  and  then  show  simulation  experiments  (§  2.6)  to  demonstrate  the 
result.  We  discuss  the  broad  implications  and  many  applications  of  the  result  in  §  2.7 
and  §  2.8.  We  then  conclude  (§  2.9)  and  finally  give  a  detailed  proof  in  the  Appendix. 


2.2  Related  Work 

We  review  related  work  here,  which  can  be  categorized  into  three  parts:  epidemic 
thresholds,  information  diffusion  and  cyber-physical  infrastructures.  None  of  these 
works  generalize  in  two  directions:  for  arbitrary  propagation  models  and  arbitrary 
networks. 

2.2.1  Epidemic  Thresholds 

Canonical  texts  for  epidemiology  include  [AM91,  HetOO].  The  most  widely-studied 
epidemiological  models  include  the  so-called  homogeneous  models  (for  example,  the  SIR 
model  was  introduced  by  McKendrick  in  the  1920's  [McK25]),  which  assume  that  every 
individual  has  equal  contact  to  others  in  the  population  and  that  the  rate  of  infection 
is  determined  by  the  density  of  the  infected  population.  Kephart  and  White  [KW93] 
were  among  the  first  to  propose  epidemiology-based  models  (the  KW  model)  to  analyze 
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the  propagation  of  computer  viruses  on  homogeneous  networks.  However,  there  is 
overwhelming  evidence  that  real  networks  including  social  networks,  router  and  AS 
networks  [FFF99]  etc.  follow  a  power  law  structure  instead.  Pastor-Satorras  and  Vespig- 
nani  [PSV01]  studied  viral  propagation  for  random  power-law  networks,  and  showed 
low  or  non-existent  epidemic  thresholds,  meaning  that  even  an  agent  with  extremely 
low  infectivity  could  propagate  and  persist  in  the  network.  They  use  the  "mean-field" 
approach,  where  all  graphs  with  a  given  degree  distribution  are  considered  equal.  There 
is  no  particular  reason  why  all  such  graphs  should  behave  similarly  in  terms  of  viral 
propagation.  In  a  recent  work,  Castellano  and  Pastor-Satorras  [CPS10]  empirically  argue 
that  some  special  family  of  random  power-law  graphs  have  a  non-vanishing  threshold 
under  the  SIR  model  in  the  limit  of  infinite  size,  but  provide  no  theoretical  justification. 

Newman  [New05a,  New02]  mapped  the  SIR  model  to  a  percolation  problem  on  a  net¬ 
work  and  studied  thresholds  for  multiple  competing  viruses  on  special  random  graphs. 
Finally,  Chakrabarti  et.al.  [CWW+08]  and  Ganesh  et.al  [GMT05]  gave  the  threshold  for 
the  SIS  model  on  arbitrary  undirected  networks.  Hence,  none  of  the  earlier  work  focuses 
on  epidemic  thresholds  for  arbitrary  virus  propagation  models  on  arbitrary,  real  graphs. 

2.2.2  Information  Diffusion 

There  is  a  lot  of  research  interest  in  studying  dynamic  processes  on  large  graphs, 
(a)  blogs  and  propagations  [GGLNT04,  KNRT03,  KKT03,  RD02],  (b)  information  cas¬ 
cades  [BHW92,  GLM01,  Gra78,  GRLK10,  ZWF+11]  and  (c)  marketing  and  product  pen¬ 
etration  [Rog03,  LAH06].  Competitive  cascades  have  been  studied  in  [PBS10,  PBRF12], 
Various  optimization  problems  have  also  been  studied  on  such  processes  like  influ¬ 
ence  maximization  [KKT03,  CWW10,  SKOM12]  and  finding  effectors  [LTGM10].  These 
dynamic  processes  are  all  closely  related  to  virus  propagation,  with  many  directly 
based  on  epidemiological  models  [Bas69,  KKT03]  e.g.,  the  award-winning  linear-cascade 
model  [KKT03]  is  a  special  case  of  our  model :  specifically  it  is  essentially  a  SIR  model 
with  6  =  1  and  all  our  results  carry  through. 

2.2.3  Cyber-physical  infrastructures 

Cascade  models  have  also  been  applied  to  real-world  networks  to  understand  network 
robustness  in  cyber-physical  infrastructures,  i.e.,  the  ability  of  a  network  to  continue 
it's  function  in  light  of  failures.  The  exact  nature  of  a  network's  "function"  varies  from 
network-to-network  and  is  typically  determined  during  their  design  phase.  Models 
related  to  the  spread  of  disease-similar  to  those  modeled  herein-have  been  used  to 
analyze  the  robustness  of  networks  against  node  failures,  for  instance,  see  Chakrabarti  et 
al.  [CLF+07],  Another  such  work  is  Buldyrev  et  al.  [BPP+10],  which  explores  cascading 
failures  in  coupled  power  and  data  networks.  Their  model  is  based  on  a  percolation 
model  common  in  statistical  physics,  and  can  be  shown  equivalent  to  the  SIR  model  we 
describe  later  in  the  chapter. 
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Table  2.1:  Common  Terminology 


Term 

Definition 

VPM 

virus-propagation  model 

NLDS 

non-linear  discrete-time  dynamical  system 

(3 

attack /transmission  probability  over  a  contact-link 

6 

healing  probability  once  infected 

y 

immunization-loss  probability  once  recovered  (in  SIRS)  or  vigilant  (in 
SIV,  SEIV) 

e 

virus-maturation  probability  once  exposed  hence,  1  —  e  is  the  virus- 
incubation  probability 

0 

direct-immunization  probability  when  susceptible 

A 

adjacency  matrix  of  the  underlying  undirected  contact-network 

N 

number  of  nodes  in  the  network 

Ai 

largest  (in  magnitude)  eigenvalue  of  A 

s 

effective  strength  of  a  epidemic  model  on  a  graph  with  adjacency  matrix 

A 

2.3  Problem  Formulation 

Table  2.1  and  Table  2.2  list  common  terminology  and  describe  some  of  the  epidemic 
models  we  will  be  using  in  the  work.  We  use  the  term  'cascade  model'  and  'virus 
propagation  model'  interchangeably.  We  next  state  formally  the  problem  we  address  in 
this  chapter: 

Problem  2.1.  Epidemic  Threshold 

Given :  A  undirected  umveighted  graph  G,  and  a  virus  propagation  model  (VPM)  and  its 
parameters  (e.g.,  (3  and  5  for  SIR). 

Find:  A  condition  under  which  will  an  infection  will  die  out  and  not  cause  an  epidemic  on  the 
graph. 


2.4  Results 

The  epidemic  threshold  is  usually  defined  as  the  minimum  level  of  virulence  to  prevent 
a  viral  contagion  from  dying  out  quickly  [AM91,  HetOO,  BBV10,  Kle07],  In  order  to 
standardize  the  discussion  of  threshold  results,  we  express  the  threshold  in  terms  of  the 
normalized  effective  strength,  s,  of  a  virus  which  is  a  function  of  the  particidar  propagation 
model  and  the  particular  underlying  contact-network.  So  we  are  'above  threshold'  when 
s  >  1,  'under  threshold'  when  s  <  1  and  the  threshold  or  the  tipping  point  is  reached 
when  s  =  1.  The  effective  strength  s  can  be  thought  of  as  the  basic  reproduction  number 
Ro  frequently  used  in  epidemiology  [HetOO,  AM91].  It  (s)  is  then  very  roughly,  the  "net" 
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Table  2.2:  Some  Virus  Propagation  Models  (VPMs) 


Model 

Description 

SIS 

'susceptible,  infected,  susceptible'  VPM  -  no  immunity,  like  flu 

SIR 

'susceptible,  infected,  recovered'  VPM  -  life-time  immunity,  like  mumps 

SIRS 

VPM  with  temporary  immunity 

SIV 

'susceptible,  infected,  vigilant'  VPM  -  immunization /vigilance  with 
temporary  immunity 

SEIR 

'susceptible,  exposed,  infected,  recovered'  VPM  -  life-time  immunity 
and  virus  incubation 

SEIV 

VPM  with  vigilance /immunization  with  temporary  immunity  and 
virus  incubation 

generalized  Ro  for  the  vims  model  and  an  arbitrary  graph  and  is  the  quantity  which 
determines  the  tipping  point  of  an  infection  over  a  contact-network.  Our  main  result  is: 

Theorem  2.1  (G2-threshold  theorem).  For  any  virus  propagation  model  (satisfying  our  general 
initial  assumptions;  see  Section  2.5  for  details)  operating  on  an  arbitrary  undirected  graph  with 
adjacency  matrix  A  and  largest  eigenvalue  Ai,  the  virus  will  get  wiped  out  if: 


s  <  1 


(2.1) 


where,  s  (the  effective  strength)  is: 

s  =  Ai  ■  Cvpm 


(2.2) 


and  CVpm  is  an  explicit  constant  dependent  on  the  virus  propagation  model.  Hence,  the  tipping 
point  is  reached  when  s  =  1. 


Proof.  We  give  a  roadmap  in  the  next  section  and  a  detailed  proof  in  the  Appendix.  □ 


Firstly,  note  that  our  result  separates  out  the  effect  of  the  network  and  the  VPM. 
Secondly,  our  result  subsumes  older  results  on  (a)  contact-networks,  and  (b)  VPMs  as 
special  cases.  Results  on  contact-networks  like  cliques  (everybody  contacts  everybody 
else:  Ai  =  N  —  1,  N  is  the  number  of  nodes  in  the  graph),  random  Erdos-Renyi  graphs 
with  expected  degree  d  (Ai  =  d),  'homogeneous'  graphs  [KW93],  power-law /scale- 
free  graphs  [PSV01],  structured  hierarchical  (near-block-diagonal)  topologies  [FIY84] 
(people  within  a  community  contact  all  others  in  this  community,  with  a  few  cross¬ 
community  contacts)  etc.  are  special  cases.  Likewise,  all  standard  virus  propagation 
models  [FletOO,  EK10]  are  specific  instantiations  of  the  generalized  model  used  in  our 
theorem  (see  Figure  2.2;  more  later). 

Table  2.3  lists  a  few  of  our  threshold  expressions  after  applying  our  result  on  some 
standard  epidemic  models.  The  popular  models  listed  include  SIS  (no  immunity,  like  flu. 
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Table  2.3:  Threshold  Results  for  Some  Models. 


Models 

Effective  Strength  (s) 

Threshold  (tipping  point) 

SIS,  SIR,  SIRS,  SEIR 

s  =  Ar( 

T) 

SIV,  SEIV 

s  =  At 

(  Py  ^ 

U(y+0)y 

S  =  1 

SIiI2V1V2  (~  H.I.V.) 

s  =  At 

(  \ 

\  Wfe+vj)  J 

SIS  (susceptible/infected/susceptible)  has  no  immunity  (like  flu),  SIR  (susceptible/infected/recovered) 
has  permanent  immunity  (like  mumps),  SIRS  has  temporary  immunity  (like  pertussis)  while 
SEIR  (susceptible/exposed/infected/recovered)  has  additional  virus  incubation  and  SI1I2V1V2  has 
been  used  to  model  some  H.I.V.  infections  [2].  SEIV  and  SIV  are  two  useful  generalizations.  (3 
is  the  attack/ transmission  probability  over  a  contact  link,  5  is  the  healing  probability,  y  is  the 
immunization-loss  probability,  (1  —  e)  is  the  virus  incubation  probability  and  0  is  the  direct- 
immunization  probability  when  susceptible  (see  Figure  2).  Our  result  is  a  general  one  and  these 
models  just  highlight  its  ready  applicability  to  standard  VPMs  in  use. 


Susceptible-Infected-Susceptible),  SIR  (permanent  immunity,  like  mumps,  Susceptible- 
Infected-Recovered),  SIRS  (temporary  immunity,  like  pertussis),  SEIR  (virus  incubation 
in  addition  to  permanent  immunity)  etc.  (note  that  models  like  SI  inherently  don't  have 
an  epidemic  threshold  as  all  nodes  will  eventually  get  infected  on  any  graph  -  hence  our 
work  doesn't  apply  to  them). 

Table  2.3  also  lists  our  SEIV  model  (Susceptible-Exposed-Infected-Vigilant)  which 
itself  generalizes  almost  all  models  from  [HetOO]  (SIS  with  e  =  1,  y  =  1, 0  =  0;  SIR  with 
e  =  l,y  =  0, 0  =  0;  SIRS  with  e  =  1, 0  =  0  and  so  on).  Using  our  proof,  we  get  that 
the  effective  strength  for  SEIV  is  s  =  Ai  ■  .  (as  before  the  virus  dies  out  if  s  <  1). 

Note  that  this  implies  that  increasing  (3  (the  attack  probability)  strengthens  the  virus.  At 
the  same  time,  decreasing  the  healing  probability  6  also  strengthens  the  virus.  Finally, 
decreasing  0  (the  direct  immunization  probability)  and  increasing  y  (the  immunization 
loss  probability)  also  makes  the  virus  stronger.  All  of  these  fit  with  intuition  -  in  fact, 
the  usefulness  of  our  result  is  partly  in  enabling  us  to  see  these  complex  effects  on  the 
virus  strength  very  clearly.  We  discuss  some  subtler  implications  later  in  Section  2.7.  We 
discuss  our  terminology,  general  model  and  proof  sketch  next. 


2.5  Proof  Overview 

We  first  construct  a  generalized  model  (S*I2V*-  arbitrary  number  of  susceptible  and 
vigilant  states,  two  infectious  states)  that  is  powerful  enough  to  generalize  all  the  prac¬ 
tical  VPMs  (and  more)  and  satisfies  our  very  general  assumptions,  while  still  being 
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mathematically  tractable  (Figure  2.2).  We  then  approximate  our  general  model  using  a 
discrete  time  non-linear  dynamical  system  and  transform  the  tipping  point  question  into 
a  stability  problem  of  the  dynamical  system  at  an  appropriate  equilibrium  point.  We 
give  the  overview  and  roadmap  here.  As  mentioned  before,  the  full  proof  can  be  found 
in  the  Appendix. 

2.5.1  Our  Terminology 

Note  that  any  VPM  has  some  states  and  the  choice  of  which  states  to  include  in  a  model 
depends  on  the  particular  contagion  characteristics.  Yet,  we  can  think  of  every  model  as 
having  states  essentially  in  any  of  the  following  fundamental  broad  classes: 

1.  Susceptible  Class:  Nodes  in  such  a  state  can  get  infected  by  any  neighboring  node 
(in  the  contact-network)  who  is  infectious. 

2.  Infected  Class:  In  a  state  of  this  class,  the  node  is  infectious  in  the  sense  that  it  is 
capable  of  transmitting  the  infection  to  its  neighbors.  Note  that  each  such  state  will 
have  a  transmissibility  parameter  (e.g.,  (3  in  the  SIR  model  for  the  infectious  state 
I).  Thus  this  can  include  models  with  transmissibility  parameter  =  0  i.e.  they  are 
'exposed'  but  not  infectious  (e.g.,  the  E  state  in  the  SEIR  model  is  a  state  which  is  in 
the  Infected  class  in  the  sense  that  it  can  potentially  cause  infections  but  is  not  by 
itself  infectious). 

3.  Vigilant/Vaccinated  Class:  Nodes  in  any  of  the  states  in  this  class  cannot  get  infected 
nor  can  they  potentially  cause  infections.  States  like  R  in  SIR  (the  recovered/ died 
state  where  the  node  gets  permanent  immunity/ dies  and  hence  does  not  partic¬ 
ipate  in  the  epidemic  further),  M  in  MSIR  (the  passive  immune  state),  etc.  are 
conceptually  of  the  Vigilant  type. 


2.5.2  Our  General  Model 

Using  our  terminology  above,  we  can  now  describe  the  generalized  model  we  used  in 
Theorem  2.1:  ST2V  (arbitrary  number  of  susceptible  and  vigilant  states,  two  infectious 
states).  As  our  general  characterization,  STVhs  powerful  enough  to  seamlessly  capture 
all  the  practical  models  (and  more)  like  SIS,  SIR,  SIRS,  SEIR,  SERIS,  MSIR,  MSEIR 
etc.  [EletOO,  EK10],  including  H.I.V.  [AM91],  while  being  tractable  enough  to  yield  simple 
threshold  equations.  Figure  2.2  shows  the  state  diagram  under  S*I2V*for  a  node  in 
the  contact-network  together  with  the  assumptions  on  the  transitions.  The  red-curvy 
arrow  indicates  exogenous  (graph-based )  transition  caused  by  infectious  neighboring 
nodes  while  all  other  transitions  are  endogenous,  caused  by  the  node  itself  with  some 
probability.  We  have  shown  only  cross-class  transitions  and  their  types.  We  make  two 
assumptions: 
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Endogenous 

Transitions 


Figure  2.2:  State  Diagram  for  a  node  in  the  graph  in  our  generalized  model  S*I2V*-  it  is  not  a 
simple  Markov  chain.  There  are  three  classes  (types)  of  states  -  Susceptible  (healthy  but  can 
get  infected).  Infected  (capable  of  transmission)  and  Vigilant  (healthy  and  can't  get  infected). 
Within-class  transitions  not  shown  for  clarity.  Red-curvy  arrow  indicates  exogenous  i.e.  graph- 
based  transition  affected  only  by  the  neighbors  of  the  node,  all  other  transitions  are  endogenous 
(caused  by  the  node  itself  with  some  probability  at  every  time  step).  (Left  Inset)  Special  case: 
Transition  diagram  for  the  SIR  (Susceptible-Infected-Recovered)  model.  (Right  Inset)  Another 
special  case:  Transition  diagram  for  the  SEIV  (E  stands  for  exposed  but  not  infectious)  model. 
SEIV  itself  generalizes  almost  all  models  from  [HetOO]  (SIS  with  e  =  l,y  =  1, 0  =  0;  SIR  with 
e  =  1,  y  =  0, 0  =  0;  SIRS  with  e  =  1, 0  =  0  and  so  on). 
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1.  Infection  through  Neighbors:  The  only  way  to  get  infected  is  through  your  neighbors 
i.e.  there  is  no  path  to  a  state  in  the  Infected  class  from  a  state  in  the  Susceptible 
class  composed  solely  of  endogenous  transitions. 

2.  Starting  Infected  State:  For  the  few  models  that  have  more  than  one  infectious  state, 
any  exogenous  (graph-based)  transition  always  results  in  a  transition  from  a  state 
in  the  Susceptible  class  to  the  h  state.  Note  that  this  assumption  is  trivially  obeyed 
for  a  vast  majority  of  models  (with  only  one  infected  state). 

Figure  2.2  (Left  Inset)  shows  the  popular  SIR  model  as  an  instantiation  of  our  general 
model  ST2V\  Also,  Figure  2.2  (Right  Inset)  shows  an  instantiation  in  the  form  of  our 
SEIV  model  (Susceptible-Exposed-Infected-Vigilant). Figure  2.3  shows  the  generalization 
hierarchy  for  some  common  epidemic  models  and  our  main  generalization  S*I2V*.  The 
brown  colored  nodes  denote  standard  VPMs  found  in  literature  while  the  blue  colored 
nodes  denote  our  generalizations.  Each  VPM  is  a  generalization  of  all  the  models  below 
it  e.g.,  SIV  is  a  generalization  of  SIRS,  SIR  and  SIS. 


2.5.3  Proof  Sketch 


We  define  the  vector  Pt  such  that  it  specifies  the  state  of  the  system  at  time  t;  the  exact 
definition  will  differ  from  model  to  model  but  it  effectively  encodes  the  probability 
of  each  node  in  the  graph  of  being  in  any  given  state  at  time  t.  Suppose  the  virus- 
propagation  model  has  m  (si,  S2, . . . ,  sm)  states  (e.g.,  m  =  3  for  the  SIR  model  with  states 
si  =  S,  S2  =  I  and  S3  =  R)  and  it  operates  on  a  graph  of  N  nodes.  Consider  then  a  column 
vector  Pt  G  'JtnvN  x  l,  which  captures  the  probability  of  each  node  being  in  any  of  m  states 
at  a  given  time  t.  Specifically: 


Pt  —  [P si  ,l,t/  P Si,2,t/  •  •  •  /  P 


Si,N,t/  ■  S2,l,t/ 


...,Ps 


.N.tJ 


(2.3) 


where,  PSl/j/t  is  the  probability  that  node  j  is  in  state  St  at  time  t.  A  Non-Linear  Dynamical 
System  (NLDS)  can  be  represented  by  Pt+i  =  g(Pt)  where  g  is  some  non-linear  function 
operating  on  a  vector.  The  function  g  in  our  case  is  large  and  complicated.  The  NLDS 
equation  essentially  tracks  the  evolution  of  the  vector  Pt  over  time.  An  equilibrium  point 
(also  called  a  fixed  point)  of  the  system  is  the  state  vector  (i.e.  some  particular  P)  which 
does  not  change.  Thus  at  the  equilibrium  point  Pt+1  =  Pt  =  x.  Intuitively,  the  tipping 
point  for  any  model  then  deals  with  analyzing  the  stability  of  the  corresponding  NLDS 
at  the  point  when  none  of  the  nodes  in  the  graph  are  infected,  because  otherwise  the 
infection  can  still  spread.  If  the  equilibrium  is  unstable,  a  small  "perturbation"  (physically 
in  the  form  of  a  few  initial  nodes  getting  infected)  will  push  the  system  further  away 
(which  physically  means  more  and  more  nodes  will  get  infected  leading  to  an  epidemic). 
But  if  the  equilibrium  is  stable,  the  system  will  try  to  come  back  to  the  fixed  point  without 
going  "too-far"  away,  in  effect,  "controlling  the  damage".  At  threshold,  the  tendencies 
to  go  further  away  and  come-back  will  be  the  same.  In  other  words,  the  equilibrium  is 
stable  below  the  threshold  and  is  neutral  at  the  tipping  point.  From  dynamical-system 
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Figure  2.3:  Virus  Propagation  model  hierarchy  (actually,  lattice)  for  some  standard  models 
including  SIRS  (temporary  immunity),  SIV  (vigilance,  i.e.,  pro-active  vaccination);  SEIV  (includes 
the  'exposed  but  not  infectious'  state,  and  temporary  vigilance);  MSEIR  (with  the  passive  immune 
state  M);  and  our  main  generalization  S*I2V\  The  brown  colored  nodes  denote  standard  VPMs 
found  in  literature  while  the  blue  colored  nodes  denote  our  generalizations.  Each  VPM  is  a 
generalization  of  all  the  models  below  it. 
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literature,  we  know  how  to  relate  the  stability  of  the  system  at  the  equilibrium  point  to 
the  spectrum  of  the  Jacobian  matrix  at  that  point  (i.e.  yg(x)).  We  eventually  reduced  the 
requirement  on  the  eigenvalues  of  yg(x)  f°r  any  virus  propagation  model  to  a  simple 
condition  on  the  eigenvalue  of  the  adjacency  matrix.  This  condition  translates  into  the 
effective  strength  of  the  virus  under  the  model.  The  reason  we  can  reduce  the  condition 
to  one  on  the  adjacency  matrix  is  due  to  the  special  structure  of  the  virus  models,  which 
was  captured  by  the  S  I2V  model  described  before.  See  the  Appendix  for  the  full  proof. 


2.6  Experiments 

We  performed  computer  simulation  experiments  on  two  large  networks  topologies,  to 
demonstrate  our  result.  All  the  different  virus  propagation  models  were  implemented  as 
a  discrete  event  simulation  in  C++.  We  ran  each  simulation  for  1000  time  ticks  and  took 
the  average  of  100  runs.  Initially,  10  nodes  were  infected  with  the  virus  and  we  then  let 
the  propagation  take  over  according  to  the  particular  model.  The  datasets  we  used  were: 

1.  AS -OREGON:  This  network  represents  the  Internet's  Autonomous  System  (AS) 
connectivity  derived  from  public  data  sets  collected  by  the  Oregon  Route  Views 
project1.  It  contains  15,420  links  among  3,995  AS  peers.  The  Oregon  graph  is 
relevant  to  studying  the  robustness  of  router  networks  to  worm  attacks  [LZZ+03]. 
More  information  can  be  found  from  http://topology.eecs.umich.edu/ 
data .  html. 

2.  PORTLAND:  It  is  one  of  the  biggest  available  physical  contact  graphs,  representing 
a  synthetic  population  of  the  city  of  Portland,  Oregon,  USA  [NDS07].It  is  a  social- 
contact  graph  containing  more  than  31  mil.  links  (interactions)  among  about  1.6  mil. 
nodes  (people).  The  data  set  is  based  on  detailed  microscopic  simulation-based 
modeling  and  integration  techniques  and  has  been  used  in  modeling  studies  on 
smallpox  outbreaks  as  well  as  policy  making  at  the  national  level  [EGAK+04]. 

Figures  2.4  and  2.5  illustrate  our  result  via  simulation  experiments  on  PORTLAND  and 
AS-OREGON  respectively.  Above  threshold,  note  the  steady-state  behavior  in  SEIV  and 
the  initial  explosive  phase  and  eventual  decay  in  SIR  and  SEIR  (because  the  number  of 
susceptible  nodes  decrease  monotonically).  Also  note  the  initial  'flat'  period  in  the  time 
plots  for  above  threshold  for  the  models  having  the  exposed  (E)  state,  SEIR  and  SEIV. 
This  is  due  to  the  virus  incubation  period  because  of  which  there  is  an  initial  delay  in 
number  of  infected  nodes.  This  then  results  in  an  initial  'silent'  period  after  which  the 
epidemic  takes-off.  As  there  is  no  such  incubation  period  in  SIR  and  SIRS,  their  plots  do 
not  show  such  silent  periods. 

In  contrast,  under  threshold,  the  number  of  infections  aggressively  goes  down  to 
zero  in  all  the  models.  In  addition,  as  our  result  predicts,  the  precise  point  when  the 
footprint  of  infection  suddenly  jumps  in  all  models  is  at  s  =  1.  The  footprint  measures  the 

1The  University  of  Oregon  Route  Views  Project,  http:/ /www.routeviews.org 
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(a) 
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(b) 


(d) 


(e) 
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(g) 


(h) 


Figure  2.4:  Simulation  Results  on  the  PORTLAND  graph,  all  values  averages  over  100  runs. 
(a),(c),(e),(g)  Plot  of  Infective  Fraction  of  Population  vs  Time  (log-log)  for  SIR,  SEIR,  SIRS  and 
SEIV  models.  Note  the  qualitative  difference  in  behavior-  two  curves  under  (green)  the  threshold 
and  two  curves  above  (red)  the  threshold.  (b),(d),(f),(h)  "Take-off"  plots.  Footprint  (see  Section  2.6) 
vs  Effective  Strength  (lin-log)  for  SIR,  SEIR,  SIRS  and  SEIV  models.  The  tipping  point  exactly 
matches  our  prediction  (s  =  1)  in  all  cases. 
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Figure  2.5:  Simulation  Results  on  the  AS -OREGON  graph,  all  values  averages  over  100  runs. 
(a),(c),(e),(g)  Plot  of  Infective  Fraction  of  Population  vs  Time  (log-log)  for  SIR,  SEIR,  SIRS  and 
SEIV  models.  Note  the  qualitative  difference  in  behavior-  two  curves  under  (green)  the  threshold 
and  two  curves  above  (red)  the  threshold.  (b),(d),(f),(h)  "Take-off"  plots.  Footprint  (see  Section  2.6) 
vs  Effective  Strength  (lin-log)  for  SIR,  SEIR,  SIRS  and  SEIV  models.  The  tipping  point  exactly 
matches  our  prediction  (s  =  1)  in  all  cases. 
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(a)Chain(Ai  =  1.73)  (b)Star(Ai  =  2)  (c)Clique(Ai  =  4) 
- *• 

Increasing  Ai 

Figure  2.6:  Why  Ai  matters  more  than  number  of  edges  E:  changing  connectivity  and  vulnerability 
of  graphs  with  changing  Ai-  The  clique  (largest  Ai)  is  the  most  vulnerable.  Note  that  E  is  not 
enough:  star  and  chain  have  the  same  number  of  edges  (E  =  4)  but  the  star  is  intuitively  more 
vulnerable,  as  our  result  also  says  (it  has  a  higher  Ai). 


extent  of  infection:  For  models  with  a  steady-state  behavior  (SIS/SIRS),  it  is  defined  as 
the  maximum  number  of  infections  at  any  instant  till  we  reach  steady  state.  For  models 
with  monotonous  decrease  of  susceptibles  (and  hence  without  a  steady  state,  SIR/SEIR), 
footprint  is  the  final  number  of  cured/ removed  nodes  from  the  network  at  the  end  of  the 
infection.  Figures  2.4  b,  d,  f,  h  and  2.5  b,  d,  f,  h  also  demonstrate  the  simplicity  and  power 
of  our  result.the  only  variable  we  need  for  determining  the  epidemic  threshold  of  the 
whole  system  consisting  of  multiple  parameters  is  the  effective  strength  (s  =  Ax  ■  CVpm)/ 
nothing  else. 


2.7  Implications 

We  first  discuss  some  direct  implications  of  the  G2-threshold  theorem:  the  vulnerability 
of  graphs  to  epidemics  and  some  unexpected  results  in  specific  models. 

2.7.1  Vulnerability  of  Networks-focus  on  eigenvalues 

What  exactly  does  the  result  mean  w.r.t.  the  graph?  Intuitively,  Aj  (also  known  as  the 
spectral  radius)  of  a  graph  captures  the  connectivity  of  the  graph.  More  connected  the 
graph  is,  more  vulnerable  it  is  to  an  epidemic  by  a  virus  (see  Figure  2.6).  Our  threshold 
results  suggest  that  an  arbitrary  graph  behaves  in  the  same  way  to  a  Ai  -regular  graph 
(both  will  have  the  same  Ai).  The  entire  dynamics  of  the  epidemic  may  not  be  captured 
by  Ai  completely,  but  the  threshold  is  solely  dependent  on  Ai  (apart  from  parameters  of  the 
VPM).  By  making  the  relation  between  the  graph  and  threshold  explicit,  our  result  has 
many  consequences  for  the  vulnerability  of  real,  complex  networks  as  well.  For  example, 
our  result  explains  the  observed  vulnerability  of  'small-world'  networks  [WS98]:  their 
Ai  is  relatively  high  compared  to  a  regular  graph  with  the  same  number  of  nodes  and 
edges,  due  to  the  presence  of  shortcuts.  Also,  previous  results  have  shown  that  the 
epidemic  threshold  for  the  SIS  model  in  case  of  random  scale-free  networks  like  the 
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(a)  (b) 

Figure  2.7:  Counter-intuitive  results  -  neither  Incubation  rate  e  or  Immunity-loss  rate  affects  the 
threshold,  (a)  'Take-off'  plot  for  the  SEIR  model  (a  special  case  of  SEIV)  on  the  PORTLAND  graph 
(lin-log  scale).  All  three  curves  are  on  top  of  each  other,  (b)  'Take-off'  plot  for  the  SIRS  model 
(a  special  case  of  SEIV)  on  the  PORTLAND  graph  (lin-log  scale)  -  higher  means  more  infections 
(increasing  with  the  loss  of  immunization  y).  Note  that  in  both  the  cases  it  does  not  affect  the 
threshold  (the  tipping  point  is  still  at  effective  strength  s  =  1).  All  values  are  averages  over  100 
runs. 


Internet  is  vanishingly  small  as  the  size  N  of  the  network  increases  [PSV01,  AJB00]. 
This  is  a  corollary  of  our  result:  When  a  power-law  graph  grows  (N  — >  oo),  the  largest 
eigenvalue  grows  with  the  maximum  degree  [CLV03],  which  also  grows  to  infinity,  and 
thus  the  threshold  approaches  zero. 

2.7.2  Counter-intuitive  Results 

Apart  from  the  dependence  of  the  threshold  on  Ai,  it  is  instructive  to  note  unexpected  re¬ 
sults  in  some  specific  models.  The  SEIR,  SIRS  and  SEIV  models  serve  well  to  demonstrate 
the  effect  of  virus-incubation  and  direct-immunization.  See  Figure  2.7.  The  threshold  in 
SEIR  surprisingly  does  not  depend  on  the  virus-incubation  probability:  the  parameter  e, 
in  effect,  only  delays  /  speeds-up  the  achievement  of  the  threshold,  not  what  the  threshold 
itself  is.  Similarly,  the  threshold  in  SIRS  does  not  depend  on  y.  Also,  from  the  threshold 
equation  of  SEIV  (Table  2.3),  we  can  infer  that  lowering  the  rate  of  loss  of  immunity  i.e. 
having  a  smaller  y  (say  due  to  better  hygiene)  decreases  the  effective  strength  s  (and 
makes  it  harder  for  the  virus  to  cause  an  epidemic)  only  so  long  as  there  is  a  mechanism 
to  give  a  node  direct  immunity  i.e.  having  a  non-zero  0  (say  by  using  a  vaccine)  before  an 
infection  (in  the  Susceptible  state)  instead  of  after  (in  the  Recovered  state).  Satisfyingly, 
this  fits  well  with  the  old  adage  'Prevention  is  better  than  Cure'. 


2.8  Impact 

Our  results  can  be  fundamental  to  a  wide-range  of  applications.  We  mentioned  broader 
impact  in  §  2.1  before.  Here  we  briefly  discuss  some  immediate  applications  in  epidemi- 
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ology  and  cyber-physical  infrastructures. 


2.8.1  Effective  Immunization 

Given  the  linear  dependence  on  Ai  of  our  threshold,  we  can  propose  a  simple  immuniza¬ 
tion  goal.  For  any  virus,  remove  (immunize)  those  nodes  whose  removal  will  decrease 
the  Ai  value  the  most  (so  that  the  resultant  infection  falls  below  threshold  and  dies  out) 
e.g.,  immunize  teachers  and  kindergarten  children  first  to  control  the  epidemic.  A  lot 
of  work  targets  immunizing  high-degree  nodes  in  scale-free  networks  [CHbA03]  which, 
while  a  good  idea,  is  not  optimal:  just  concentrating  on  high-degree  nodes  will  miss 
those  low-degree  nodes  which  are  good  "bridges"  and  can  have  an  important  influence 
on  decreasing  Ai  when  immunized.  For  example,  intuitively,  the  sole  common  friend 
between  two  disparate  yet  internally  well-connected  groups  (like  say  between  scientists 
and  movie  celebrities)  can  have  a  huge  impact  in  the  outbreak  of  a  disease  even  if  (s)he 
knows  only  a  few  people  in  each  community. 


2.8.2  Evaluating  'What-if'  Scenarios 

Our  result  can  also  help  quickly  determine  the  result  of  plausible  situations  e.g.,  is  there 
a  danger  of  an  epidemic  if  the  virus  is  twice  (or  half)  as  infectious  (virulent)?  This  can 
then  feed  into  policy  decisions  for  controlling  epidemics,  like  imposing  restrictions  on 
travel  so  as  to  not  increase  the  Aj.  Policy  makers  can  assume  any  graph  model  which 
best  captures  the  contact  behavior  of  the  population  and  still  use  our  threshold  result  to 
guide  immunization  policies. 


2.8.3  Accelerating  Simulations 

Similarly,  we  can  considerably  simplify  expensive  epidemiological  simulations  as  well. 
For  example,  running  a  typical  simulation  with  one  set  of  parameters  of  a  flu  epidemic 
on  a  population  of  size  33  million  (~  size  of  the  state  of  California)  takes  about  2  days  on 
a  cluster  of  50  machines  [BBE+08].  Using  our  result,  we  can  eliminate  parameters  which 
do  not  affect  the  effective  strength  of  the  contagion  and  also  quickly  identify  parameter 
spaces  where  simulations  would  be  useful  (i.e.  above  threshold).  Clearly,  the  main  task 
of  such  a  testing  will  be  eigenvalue  computations.  For  this  purpose,  there  are  already 
very  efficient  algorithms  like  Lanczos  for  sparse  graphs  which  take  2-5  mins  for  networks 
of  millions.  Moreover,  structured  topologies  like  cliques,  block-diagonal  matrices  lend 
themselves  to  even  faster  eigenvalue  computations,  making  it  very  easy  to  apply  our 
result  to  real  world  simulations. 
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2.8.4  Applications  to  Computer  Networking 

As  mentioned  in  Section  2.2,  the  epidemic  threshold  can  be  applied  to  a  number  of 
network  robustness  scenarios  in  cyber-physical  infrastructures.  For  example,  the  Kadem- 
lia  DHT  [MM02]  is  used  in  a  number  of  P2P  networks-such  as  BitTorrent-to  form  a 
decentralized  P2P  overlay  and  lookup  table.  In  particular,  Mainline  BitTorrent  [(Well]  im¬ 
plements  a  version  of  Kademlia  as  an  alternative  to  the  typical  centralized  tracker. When 
a  BitTorrent  user  queries  the  system  for  a  particular  torrent  file,  it  is  the  DHT's  respon¬ 
sibility  to  return  the  torrent  file  (i.e.,  BitTorrent  swarm  bootstrap  information).  Hence, 
using  our  result,  the  network  overlay  structure  of  the  DHT-in  particular,  it's  eigenvalue- 
may  be  used  to  evaluate  the  data  replication  necessary  to  guarantee  the  torrent  file  is 
reachable. 


2.9  Conclusion 

In  summary,  we  studied  the  problem  of  determining  the  epidemic  threshold  given  the 
virus  propagation  model  and  an  underlying  arbitrary  undirected  unweighted  graph. 
Intuitively,  the  answer  should  depend  both  on  the  graph  and  the  propagation  model. 
Earlier  results  have  focused  on  either  special  cases  of  graphs  or  special  models.  In  this 
chapter,  we  give  a  formula  for  the  epidemic  threshold  which  shows: 

1.  De-coupling:  The  effect  of  the  topology  and  the  propagation  model  on  the  threshold 
is  clearly  de-coupled, 

2.  Arbitrary  Topology:  The  effect  of  the  undirected  underlying  topology  is  determined 
only  by  A,  (the  largest  eigenvalue  of  the  adjacency  matrix), 

3.  Arbitrary  VPM:  The  effect  of  the  virus  propagation  model  is  determined  by  a  model 
dependent  constant. 

Thus,  all  previous  epidemic  threshold  results  are  specific  instantiations  of  our  G2- 
threshold  theorem.  Our  results  can  be  used  for  forecasting  and  estimations  in  'what-if ' 
scenarios,  for  control  and  manipulation  of  propagation  and  related  dynamical  processes 
(immunization,  marketing  policies  etc.).  Moreover,  our  result  can  be  easily  extended  to 
handle  even  more  elaborate  settings  such  as  (a)  time-varying  topologies  (extending  the 
SIS-only  results  of  [BBE  1  08,  PTV  1 10]),  and  (b)  multiple  competing  diseases  (extending 
the  random  power-law-graphs-only  results  of  [New05a]). 


APPENDIX 

We  give  the  full  proof  of  Theorem  2.1  here. 
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2.A  Notation 


Recall  that  we  are  dealing  with  the  ST2V‘  generalized  model  -  it  has  two  states  h  and  I? 
in  the  Infected  class.  To  simplify  notation,  we  refer  to  state  h  as  E  (the  'infection  entrance 
state')  and  I2  as  the  I  state  in  the  proofs.  The  state  E  has  a  transmission  probability  of 
3i  and  the  state  I  has  a  transmission  probability  of  |32.  The  states  E  and  I  here  should 
be  thought  as  to  mean  general  infected  states  of  our  model  and  not  in  the  sense  of  the 
specific  E  and  I  states  in  epidemic  models  like  SEIR,  SEIV  etc.  We  also  refer  to  the  exoge¬ 
nous  transitions  as  graph-based  and  endogenous  transitions  as  internal  interchangeably. 
Table  2.4  gives  some  of  the  additional  notation  we  will  be  using  in  our  description  of  the 
proof. 


Table  2.4:  Additional  Notation  and  Symbols  used  in  the  proofs 


Symbol 

Definition 

m 

total  number  of  states  in  the  model 

q 

total  number  of  states  in  the  Susceptible  and  Vigilant  classes  of  the  model; 
hence  m  =  q  +  2 

w 

total  number  of  states  in  the  Susceptible  class  of  the  model 

^1/  S 2 /  •  •  •  / 

general  states  in  the  Susceptible  class 

E,  I 

general  states  in  the  Infected  class 

«KU 

probability  (constant  and  given)  of  transition  from  state  K  to  state  U 

Pi 

transmission  probability  for  state  E 

(32 

transmission  probability  for  state  I 

Ci,t(E,I) 

probability  that  a  node  i  does  not  receive  any  infections  from  E  and  I  at  time  t 

X 

the  fixed  point  vector  our  NLDS  corresponding  to  when  no  node  is  in  any  of 
the  Infected  class  states 

Ps, 

(same  for  each  node)  probability  of  being  present  in  the  Sy  state  at  x 

a 

Jacobian  matrix  of  the  NLDS  computed  at  x 

2.B  System  Equations 

We  can  develop  the  system  equations  i.e.  explicitly  specify  the  non-linear  function  g  for 
the  NLDS  based  on  the  transition  diagram  of  the  model.  As  stated  earlier  in  Section  2.5.2 
we  assume  that  infections  are  received  only  from  infected  neighbors  i.e.  those  in  states 
E  and  I  the  Infected  class  of  states.  Firstly,  let's  calculate  the  probability  that  a  node  I 
does  not  receive  any  infections  in  the  next  time  step  (call  it  Ct,t(E,  I),  E,  I  denotes  that  an 
infection  is  passed  only  from  a  neighbor  in  the  E  or  I  states).  No  infections  are  transmitted 
if: 


•  Either  a  neighbor  is  not  any  of  the  infected  states  E  and  I 
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•  Or  it  is  in  state  E  and  the  transmission  fails  with  probability  1  —  13! 

•  Or  it  is  in  state  I  and  the  transmission  fails  with  probability  1  —  |32 

Since  we  assume  infinitesimally  small  time  steps  (At  — »  0),  multiple  events  can  be  ignored 
for  first-order  effects  in  the  time  step.  Also,  assuming  the  neighbors  are  independent,  we 
get: 

Cu(E,  I)  =  II  (Pe'A(1  -  W  +  PEU(1  -  02)  +  (1  -  PE,j,t  -  Pij,t)) 

j€N£(i) 

=  II  (l-Aia(0iPE,j,t  +  02Pij/t))  (2.4) 

je{i..N} 

where  N£(i)  is  the  set  of  neighbors  of  node  i  in  the  graph. 

Also,  the  sum  of  probabilities  of  being  in  all  the  possible  states  for  each  node  i  should 
equal  1.  Hence, 

Vu  Y.  pK,u  =  1  (2-5) 

K 

We  can  now  write  down  the  system  equations  as  follows.  A  node  i  will  be  in  any 
particular  state  Sy  of  the  Susceptible  class  at  time  t  +  1  if: 

•  Either  it  was  in  Sy  at  time  t  and  stayed  in  state  Sy  i.e.  it  did  not  receive  any 
infections  from  its  neighbors  and  it  did  not  change  state  internally  from  Sy  to  any 
other  state 

•  Or  it  was  in  some  other  state  U  and  changed  state  internally  from  U  to  Sy 

Hence,  the  probability  of  node  i  being  in  Sy  where  Sy  is  any  state  in  the  Susceptible 
class  at  time  t  +  1  is: 


Vp  —  1,2,.  .  .,W  Psy/vt+l  —  Y.  aKSyPK,i,t  +  Psy/i/t  (  Ci,t(P/  I)  —  Y-  aSyK  (^-6) 


K^Sy 


KyE,Sy,I 


Similarly,  for  the  E  state: 


Pe,u+i  =  Y_  ^kePk/U  +  Y.  ^Sy/i/t  (1  —  Cu(E,  I)) 

k^Si/S2,...,sw  y=i 

and  for  any  other  state  El  ^  {Si,  S2, . . . ,  Sw,  E}: 

Pu,i,t+1  =  Y-  aKuPK,v 


i.t 


(2.7) 


(2.8) 


As  discussed  earlier  (Equation  2.3),  we  can  now  define  a  probability  vector  Pt  by 
"stacking"  all  these  probabilities  which  will  completely  describe  the  system  at  any  time 
t  and  evolve  according  to  the  above  equations.  Note  that  the  above  equations  are 
non-linear  and  naturally  define  the  function  g  for  the  NLDS  Pt+i  =  g(Pt). 

We  have  the  following  theorem  about  NLDS  stability  at  a  fixed  point: 
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Theorem  2.2  (Asymptotic  Stability,  e.g.,  see  [HS74]).  The  system  given  by  Pt+i  =  g(Pt)  is 
asymptotically  stable  at  an  equilibrium  point  P  =  x,  if  the  eigenvalues  of  3  —  V9(*)  are  l ess  than 
1  in  absolute  value,  where, 

3x,j  —  tv  g  (^)]  =  -^Hp=x 

Hence,  next  we  compute  the  fixed  point  we  are  interested  in  and  the  Jacobian  of  our 
NLDS  at  that  point. 

2.C  Fixed  point 


Figure  2.8:  State  Diagram  for  any  node  in  the  graph  at  the  fixed  point  when  no  node  is  present 
in  a  state  in  the  Infected  class.  Only  cross-class  edges  are  shown.  Note  that  it  is  now  a  simple 
Markov  chain  with  a  unique  steady  state  probability. 

We  are  interested  in  the  stability  of  the  equilibrium  point  (i.e.  where  Pt+i  =  Pt(=  x)) 
of  the  NLDS  which  corresponds  to  when  no  one  is  infected.  Only  the  transition  from 
the  Susceptible  class  states  towards  the  Infected  class  states  are  graph-based  (and  can 
happen  only  when  at  least  one  of  the  nodes  is  in  any  of  the  Infected  states),  so  the 
state-diagram  for  each  node  will  be  a  simple  Markov  chain  (call  it  MCsv)  consisting  of 
the  Susceptible  and  Vigilant  states  (see  Figure  2.8).  Note  now  there  are  no  graph-based 
effects,  hence  each  node  is  independent  of  others  and  will  converge  to  steady  state 
probabilities  corresponding  to  the  Markov  chain.  The  steady  state  vector  n*  (size  q  x  1, 
where  q  is  the  number  of  states  in  the  Susceptible  and  Vigilant  classes)  which  will  be 
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the  same  for  each  node  can  be  computed  from  the  following  equations  from  standard 
Markov  chain  analysis: 


q 

7t*T  TranMcSv  =  A*  &  =  1 

1—1 


(2.9) 


Hence  n*  is  a  probability  vector  and  is  the  left  eigenvector  corresponding  to  eigen¬ 
value  1  of  the  stochastic  matrix  TranMCsv  of  the  Markov  chain  MCsv-  The  full  (m  x  1) 
probability  vector  p*  for  each  node  at  this  steady  state  will  have  the  entries  in  tc*  for 
states  in  the  Susceptible  and  Vigilant  classes  and  0  for  all  states  in  the  Infected  class.  The 
fixed  point  of  the  global  original  NLDS  x  can  be  finally  represented  as: 


x 


(2.10) 


where  p*  is  repeated  N  times  (once  for  each  node  in  the  graph).  Let  be  the  steady 
state  probability  value  in  the  vector  p*  corresponding  to  the  SLJ  state.  In  other  words, 
each  node  will  have  a  probability  of  pg  of  being  present  in  the  Sy  state  at  the  fixed  point. 
Also  define. 


W 


Ps  =  2_Tsy 

y=i 


(2.11) 


i.e.  p^  is  the  total  probability  of  each  node  at  the  fixed  point  of  being  present  in  any  of 
the  states  of  the  Susceptible  class. 


2.D  The  Jacobian 


We  know  from  Theorem  2.2  that  x  is  stable  if  the  eigenvalues  of  3  =  V  9(A)  are  less  than  1 
in  absolute  value.  From  the  definition  of  3  we  can  see  that  it  is  a  m  •  N  x  m  •  N  matrix 
with  m  (for  each  state)  square  blocks  of  size  N  x  N  each  (corresponding  to  every  node  in 
the  graph).  We  can  calculate  3  to  be  (states  have  been  mentioned  on  the  top  and  side  for 
ease  of  exposition  and  I  is  the  identity  matrix  of  size  N  x  N): 


33 


K 

E 

I 

^  _  HxySy.E  aSyK)I 

^Syl 

O^ESyl  —  PSyPlA 

O^ISyl  “  Ps^A 

u 

°CyUl 

<*KuI 

C^EuI 

CXiuI 

E 

CXSyEl 

°4ceI 

°^ee!  +  Ps  PiA 

<^ie!  +  Ps  P2A 

I 

CXSyll 

cxkiI 

cxeiI 

CX11I 

where  K  is  any  state  ^  {E,  1}  and  U  is  any  state  ^  {Si,  S2/ . . . ,  Sw,  E,  I}. 

Recall  the  properties  we  are  assuming  for  the  epidemic  models  discussed  in  Sec¬ 
tion  2.5.2  (also  see  Figure  2.2).  Crucially,  they  imply  Vk^e,i  cxKe  =  0  and  Vk^e,i  cxKi  =  0. 
Hence  3  reduces  to: 


K 

E 

I 

_  HxySy.E  aSyK)I 

OCKSyl 

O^ESyl  -PSyPlA 

c^iSyl  —  Ps(32A 

u 

C^SyUl 

«-kuI 

(XeuI 

ami 

E 

0n,n 

0n,n 

cxeeI  +  Ps  PiA 

cxieI  +  Ps  P2A 

I 

0n,n 

0n,n 

cxeiI 

CX11I 

where  0n,n  is  a  N  x  N  matrix  with  all  zeros. 


2.E  Eigenvalues  of  the  Jacobian 

Note  that  3  is  very  structured  and  can  be  written  as: 


3 


Bi 


|_u2N,(m— 2)N 


b2 

b3 


(2.12) 


where  Bi,  B2  and  B3  are  matrices  of  size  (m  —  2)N  x  (m  —  2)N,  (m  —  2)N  x  2N  and 
2N  x  2N  respectively.  B3  corresponds  to  the  E  and  I  rows  and  columns  of  3  Ee.: 


B3 


c^eeI  +  Ps  PiA  cxieI  +  Ps  P2A 

OC-eiI  CX-i  1 1 


(2.13) 
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B  [  and  B2  are  defined  similarly.  Consider  any  eigenvector  v  (size  mN  x  1)  and  corre¬ 
sponding  eigenvalue  Aa  of  3-  We  can  write  v  as  being  composed  of  vector  Vi  of  size 
(m  —  2)N  x  1  and  vector  v2  of  size  2N  x  1  i.e: 


Vx 

v  =  _ 


Also  x  and  Aa  satisfy  the  eigenvalue  equation: 

3v  =  Aav 

Substituting  from  Equations  2.12  and  2.14  we  get: 

Bi  B21  [vil  =, 

0  B3  v2  ‘ 


(2.14) 


(2.15) 


(2.16) 


Equation  2.16  implies  the  following  the  two  relations: 

B1V1  +  B2v2  =  Aavx  (2.17) 

B3v2  =  Aav2  (2.18) 

From  Equation  2.18  we  can  infer  that  precisely  one  of  the  following  holds: 

1.  v2  =  0 

2.  v2  is  the  eigenvector  of  B3  (and  consequently  Aa  is  the  matching  eigenvalue  of  B3) 
If  v2  =  0,  Equation  2.17  reduces  to 

B1V1  =  Aavi 

wherein  again,  either  v3  =  0  or  Aa  is  an  eigenvalue  of  Bi.  The  condition  V|  =  0  is  not 
meaningful  as  then  v  =  0  (v  is  an  eigenvector  of  3  implies  v  is  non-zero).  Therefore  the 
eigenvalues  of  d  are  given  by  the  eigenvalues  of  B|  (with  v2  =  0)  and  the  eigenvalues  of 

B, 


2.E.1  Eigenvalues  of  Ba 

From  the  expression  for  3  derived  in  Section  2.D,  note  that: 

Bx  =  T  <g>  I  (2.19) 

where  (8)  is  the  Kronecker  product  of  two  matrices  and 

(1  -  Zk^s^e  asyK)  °4<Sy  ••• 

T  =  :  :  :  (2.20) 

O-SyU  OC-KU  ••• 

We  know  from  matrix  algebra  [HJ91]  that  if  C  =  D  ®  E  then  Ca  =  Da8)Ea,  where 
Ca  denotes  a  diagonal  matrix  with  eigenvalues  of  the  matrix  C  on  the  diagonal.  But 
Ia  =  I,  hence  the  eigenvalues  of  Bx  are  the  same  as  the  eigenvalues  of  T  (although  with 
repetition).  In  other  words,  eigenvalues  of  T  are  eigenvalues  of  3  as  well. 
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2.E.2  Eigenvalues  of  B3 


be  a  corresponding  eigenvector  of  B3  (ui  and  u2  are  of  size  N  x  1  each  and 

as  the  eigenvalues  of  B3  are  also  eigenvalues  of  3,  we  use  Ag  for  an  eigenvalue  of  B3). 
Hence,  the  standard  eigenvalue  relation  B3u  =  Agu  requires  the  following  equations  to 
be  satisfied: 


Let  u  = 


Ui 

u2 


(aEEI  +  Ps  PiA)ui  +  (aiEI  +  p3  p2A)u2 

aEifli  +  anu2 

Using  Equation  2.22,  we  can  compute  Ui  in  terms  of  u?  as: 

Ag  —  ocn\  _ 


AgUi 

Agu2 


Ui  = 


aEi 


u2 


(2.21) 

(2.22) 


(2.23) 


Substituting  it  back  into  Equation  2.21  we  get: 

Ag  —  <Xn 


c^ee!  +  Ps  Pi  A) 


aEi 


+  (XiEI  +  PsP2A  u2  —  A 


Ag  —  <*n  \  ~ 


(acEE(Ag  —  an)I  +  aiEaEiI  +  (PsPi(Ag  —  an)  +  Psp2oc.Ei)A)  u2 
which  finally  gives. 


aEi 

Ag(Ag  —  (Xn)u2 


u2 


Au2 


Ag  —  (<*n  +  «.EE)Ag  +  anocEE  —  a^a^^  _ 


Ps  Pi  (Ag  -  an)  +  Psp2aEi 


u2 


(2.24) 


Again,  Equation  2.24  tells  us  that  either  u2  =  0  or  it  is  an  eigenvector  for  A.  But 
u2  =  0  =4*  U]  =  0  =4*  u  =  0  which  is  not  possible.  Thus  Equation  2.24  is  an  eigenvalue 
equation  for  the  adjacency  matrix  A  and  we  are  looking  for  solutions  Ag  and  u2  such  that 
they  satisfy  it.  Hence, 

_  Ag  —  (an  +  otEE)Ag  +  anaEE  —  aiEaEi 
A  PsPi(Ag  -  an) +  p^p2aEi 

where  Aa  is  an  eigenvalue  of  A.  This  finally  gives 


Ag  —  Ag(aEE  +  an  +  PsPiAa)  +  (otnixEE  —  aiEaEi  +  PsAA(Pian  —  P2<xEi))  —  0  (2.25) 

Thus  we  have  a  different  quadratic  equation  (Q.E.)  for  each  eigenvalue  Aa  of  A.  Each 
Q.E.  gives  us  two  eigenvalues  (possibly  repeated)  of  3- 
So,  finally,  we  can  conclude  the  following  lemma: 

Lemma  2.1  (Eigenvalues  of  3)-  Eigenvalues  of  3  are  given  by  the  eigenvalues  of  T  (Equa¬ 
tions  2.19  and  2.20)  and  the  roots  of  the  Q.Es  given  by  Equation  2.25  for  each  eigenvalue  Aa  of 

A. 
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2.F  Stability 

We  require  that  all  the  eigenvalues  of  3  to  be  less  than  1  in  absolute  value  (according  to 
Theorem  2.2).  From  Lemma  2.1,  we  have  two  cases  to  handle  in  enforcing  this: 

(Cl)  All  the  eigenvalues  of  T  should  be  less  than  1  in  absolute  value 
(C2)  All  the  roots  of  the  Q.Es  given  by  Equation  2.25  for  each  eigenvalue  Aa  of  A  should 
be  less  than  1  in  absolute  value 

2.F.1  Case  Cl 

Note  that  this  case  depends  only  on  the  model  as  the  matrix  T  is  independent  of  the 
adjacency  matrix  A.  But  T  is  a  stochastic  matrix  i.e.  all  the  column  sums  are  equal  to  1  - 
consequently  all  its  eigenvalues  are  less  than  1  in  absolute  value. 

Lemma  2.2  (Stability  Cl).  All  eigenvalues  of  the  matrix  T  (given  by  Equation  2.20)  are  less 
than  1  in  absolute  value. 

2.F.2  Case  C2 

As  Cl  is  always  true,  we  need  to  only  ensure  case  C2.  We  can  prove  here  the  following: 

Lemma  2.3  (Stability  C2).  All  the  roots  of  the  Q.Es  given  by  Equation  2.25  for  each  eigenvalue 
Aa  of  A  are  less  than  1  in  absolute  value  if: 

(  (3i(l  —  an)  +  [32(Xei  \  . 

1^S  \(1  —  —  CVee)  —  ChECtEI  ) 

Proof  Let  r\  and  r2  be  the  roots  of  Equation  2.25  (ri  and  r2  can  be  real  or  complex 
depending  on  Aa).  Then  we  want 


|ti|  <  1  and  |r2|  <  1 

ri  and  r2  are  real  As  the  roots  are  real,  Aa  is  such  that  the  discriminant  D  of  the 
quadratic  equation  is  greater  than  zero.  In  this  situation: 

Ir^  <  1  and  |r2|  <  1  =>  r1  e  (—1,1)  and  r2  e  (—1,1)  (2.26) 

From  the  theory  of  quadratic  equations,  it  is  well  known  (see  e.g.,  [Mil63])  that  for 
real  roots  Xi  and  x2  of  a  Q.E.  f(x)  =  ax2  +  bx  +  c  (with  a  >  0)  to  lie  in  the  interval  (—1, 1) 
the  following  conditions  must  be  true: 


a  —  c  >  0, 
a  —  b  +  c  >  0, 
a  +  b  +  c  >  0. 


37 


Intuitively,  the  first  condition  forces  the  product  of  the  roots  to  be  less  than  1  while  the 
last  two  conditions  state  that  value  of  f  (x)  at  —1  and  1  should  not  be  "too  small".  In  our 
case,  these  then  translate  into: 


Equations  2.27b  and  2.27c  can  be  written  as: 

—  Pi  (1  +  CXn)  +  (32<Xei 


aaPs 

aaPs 


) 


(1  +  <Xn)(l  +  aEE)  —  ociecxei  / 

(3i(1  —  an)  +  P2CXE1  \ 

(1  —  an)(l  —  oc.ee)  —  ^ieoCei  ) 


<  1 


<  1 


|32aEi)  <  1 

(2.27a) 

p2aEi)  >  0 

(2.27b) 

p2aEi)  >  0 

(2.27c) 

(2.28a) 

(2.28b) 

respectively.  The  above  equations  should  be  true  for  any  eigenvalue  Aa  of  A  which 
makes  D  >  0.  Recall  that  we  are  considering  only  undirected  graphs,  hence  A  is  a 
symmetric  binary  (0/1)  square  irreducible  matrix.  As  a  result  firstly,  all  its  eigenvalues 
are  real.  Secondly,  from  the  Perron-Frobenius  theorem  [McCOO]  the  algebraically  largest 
eigenvalue  Ai  of  A  is  a  positive  real  number  and  also  has  the  largest  magnitude  among 
all  eigenvalues.  Hence  if  the  above  equations  are  true  for  Aa  =  Ai  we  are  done.  Now 
note  that 

(1  +  <Xn)(l  +  cxEe)  —  >  (1  —  °biMl  ~  <*ee)  —  °hE°hi 


and  that 


Pi (1  —  an)  +  p2aEi  >  —  Pi (1  +  an)  + 


In  addition  the  L.H.S  in  both  equations  is  positive  under  Aa  =  An  So  Equation  2.28a  is 
always  true  if  Equation  2.28b  holds  (under  Aa  =  Ai)  i.e. 


AiPs 


Pi ( 1  _  an)  +  |32aEi  \ 
(1  —  an)(l  —  aEE )  —  aiEaEi  J 


<  1 


(2.29) 


As  Ai  is  the  largest  eigenvalue  both  algebraically  and  in  magnitude,  under  Equa¬ 
tion  2.29, 

1  —  anaEE  +  aiEaEi  —  PsAA(Pian  —  p2aEi) 

.  n  .  /(I  —  an)(l  —  aEE)  —  aiEaEi\  ra  a  . 

>  1  —  anaEE  +  a^aEi  —  - — — - .  - -  iPian  —  P2aEi) 

V  pi(l  -  an)  +  p2aEi  J 


((1  —  an)2  +  aiEaEi)  Pi  +  a£i(2  —  an  —  aEE)(32 


(1  —  an)Pi  +  aEip2 


>  0 


.•.  Equation  2.27a  is  also  true  if  Equation  2.29  holds.  Thus  the  condition  for  the  roots  to  be 
in  (—1, 1)  when  they  are  real  is  given  simply  by  Equation  2.29. 
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ti  and  r2  are  complex  In  this  case  Aa  is  such  that  T>  <  0.  Also  as  Equation  2.25  has  real 
co-efficients,  X\  and  r2  are  complex  conjugate  of  each  other  and  so  |ti|  =  |r2|  =  yTi  ■  r2. 
But  the  product  of  roots  x\  and  x?  of  the  equation  ax2  +  bx  +  c  =  0  is  equal  to  c/a.  Hence 
we  want  to  enforce  c/a  <  1.  In  our  case  it  is 

c^hocee  —  cx-ieccei  +  PsAa(|3icxii  —  |32ck.ei)  <  1 

which  is  exactly  Equation  2.27a.  From  the  above  analysis,  we  already  know  that  it  is 
true  if  Equation  2.29  holds.  So,  for  any  eigenvalue  Aa  for  which  1)  <  0,  the  roots  have 
magnitude  less  than  1  given  Equation  2.29  is  true. 

Thus  in  both  cases,  whether  roots  are  real  or  complex.  Equation  2.29  is  a  sufficient 
condition  for  the  roots  to  have  magnitude  less  than  1.  □ 

To  re-cap  we  state  our  result  and  then  give  its  proof: 

Theorem  2.3  (G2  theorem).  For  vims  propagation  models  which  satisfy  our  general  initial  as¬ 
sumptions  and  for  any  arbitrary  undirected  graph  with  adjacency  matrix  A  and  largest  eigenvalue 
Aly  the  sufficient  condition  for  stability  is  given  by: 

s  <  1 

where,  s  (the  effective  strength)  is: 

s  —  Ai  ■  C 

and  C  is  a  constant  dependent  on  the  model  (given  by  Equation  2.30).  Hence,  the  tipping  point  is 
reached  when  s  =  1. 

Proof  Lemma  2.2  and  Lemma  2.3  ensure  cases  Cl  and  C2  and  hence  together  with 
Lemma  2.1  imply  that  the  eigenvalues  of  the  Jacobian  3  of  our  general  NLDS  computed 
at  the  fixed  point  x  are  less  than  1  in  magnitude  if  Equation  2.29  is  true. 

using  Theorem  2.2,  our  general  NLDS  is  stable  at  its  fixed  point  x  if  Equation  2.29 
holds.  Recall  that  x  is  the  point  when  there  no  infected  nodes  in  the  system  (Appendix  2.C) 
and  that  this  is  the  fixed  point  whose  stability  conditions  determine  the  epidemic  thresh¬ 
old  (Section  2.5). 

.•.  finally  we  can  conclude  the  theorem  with 

r  _  *f  |3i(l  —  cxn)  +  (32cxei 

'-VPM  —  Ps  I  pj - pq - T - 

V  (1  —  CXiiJ  (1  -  CXeeJ  -  OCieCXei 

and  the  effective  strength  s  =  Aj  ■  CVpm-  The  parameter  CVpm  is  a  constant  for  a  given 
propagation  model  while  the  only  parameter  involved  from  the  underlying  contact- 
network  is  Ai,  the  first  eigenvalue  of  the  adjacency  matrix.  □ 


(2.30) 
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Chapter  3 

Epidemic  Thresholds:  Time-varying 
Graphs 


In  this  chapter,  we  focus  on  contact  networks  that  change  over  time  (say,  day  vs  night 
connectivity),  and  the  SIS  (susceptible /infected/ susceptible,  flu  like)  virus  propagation 
model.  Given  such  a  configuration,  we  want  to  ask  the  same  question  as  in  the  chapter 
before:  what  can  we  say  about  the  epidemic  threshold?  That  is,  can  we  determine  when  a 
small  infection  will  "take-off"  and  create  an  epidemic?  This  is  a  very  real  problem,  since, 
for  example,  people  have  different  connections  during  the  day  at  work,  and  during  the 
night  at  home.  As  we  saw  before,  static  graphs  have  been  studied  for  a  long  time,  with 
numerous  analytical  results.  Time-evolving  networks  are  so  hard  to  analyze,  that  most 
existing  works  are  simulation  studies  [BBE+08]. 

Specifically,  our  contributions  in  this  chapter  are:  (a)  we  formulate  the  problem  by 
approximating  it  by  a  Non-linear  Dynamical  system  (NLDS),  and  (b)  we  derive  the  first 
closed  formula  for  the  epidemic  threshold  of  time-varying  graphs  under  the  SIS  model. 


3.1  Introduction 

The  goal  of  this  work  is  to  analytically  study  the  epidemic  spread  on  time-varying  graphs. 
We  focus  on  time-varying  graphs  that  follow  an  alternating  connectivity  behavior,  which 
is  motivated  by  the  day-night  pattern  of  human  behavior.  Note  that  our  analysis  is 
not  restricted  to  two  graphs:  we  can  have  an  arbitrary  number  of  alternating  graphs. 
Furthermore,  we  focus  on  the  SIS  model  [HetOO],  which  resembles  a  flu-like  virus,  where 
healthy  nodes  get  the  virus  stochastically  from  their  infected  neighbors,  and  infected 
nodes  get  cured  with  some  probability  and  become  susceptible  again.  The  SIS  model 
can  be  also  used  in  modeling  many  different  types  of  dynamical  processes  as  well,  for 
example,  modeling  product  penetration  in  marketing  [Rog03]. 

More  specifically,  the  inputs  to  our  problem  are:  (a)  a  set  of  T  alternating  graphs,  and 
(b)  the  infectivity  of  the  virus  and  the  recovery  rate  ((3, 6  for  the  SIS  model).  We  want  to 
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answer  the  following  question  (rigorously  defined  in  Section  3.3): 

Ql.  Can  we  say  whether  a  small  infection  can  "take-off"  and  create  an  epidemic  under 
the  SIS  model  (i.e.  determine  the  so-called  epidemic  threshold)? 

While  epidemic  spreading  on  static  graphs  has  been  studied  extensively  (e.g.,  see  [HetOO, 
AM91,  PSV02,  CWW+08]),  virus  propagation  on  time-varying  graphs  has  received  little 
attention.  Moreover,  most  previous  studies  on  time-varying  graphs  use  only  simula¬ 
tions  [BBE+08].  We  review  in  more  detail  the  previous  efforts  in  Section  3.2. 

We  are  arguably  the  first  to  study  virus  propagation  analytically  on  arbitrary,  and 
time-varying  graphs.  In  more  detail,  the  contributions  of  our  work  can  be  summarized  in 
the  following  points: 

1.  We  formulate  the  problem,  and  show  that  it  can  be  approximated  with  a  Non-Linear 
Dynamical  System  (NLDS). 

2.  We  give  the  first  closed-formula  for  the  epidemic  threshold,  involving  the  first 
eigenvalue  of  the  so-called  system-matrix  (see  Theorem  3.1).  The  system-matrix 
combines  the  connectivity  information  (the  alternating  adjacency  matrices)  and  the 
characteristics  of  the  virus  (infectivity  and  recovery  rate). 

The  rest  of  the  Chapter  is  organized  as  follows:  We  review  related  work  in  Section  3.2, 
explain  the  formal  problem  definitions  in  Section  3.3,  and  describe  the  proofs  for  the 
threshold  and  illustrate  the  theorem  in  Section  3.4.  We  discuss  the  generality  of  the 
result  in  Section  3.7  and  finally  conclude  in  Section  3.8. 


3.2  Related  Work 

We  have  already  reviewed  the  related  work  cocerning  epidemic  thresholds  in  the  previ¬ 
ous  chapter.  To  summarize,  none  of  the  earlier  related  work  focus  on  epidemic  thresholds 
for  arbitrary,  real  graphs,  with  only  exceptions  of  [WCWF03,  CWW+08],  and  its  follow¬ 
up  paper  by  Ganesh  et  al.  [GMT05].  However,  even  these  works  [WCWF03,  CWW+08, 
GMT05]  assume  that  the  underlying  graph  is  fixed,  which  is  unrealistic  in  many  applica¬ 
tions.  Hence,  to  the  best  of  our  knowledge,  including  comprehensive  epidemiological 
texts  [AM91,  Bai75]  and  well-cited  surveys  [HetOO],  we  are  the  first  to  analytically  study 
virus  propagation  on  arbitrary,  real  and  time-varying  graphs. 


3.3  Problem  Definitions 

Table  3.1  lists  the  main  symbols  used  in  this  work.  Following  standard  notation,  we  use 
capital  bold  letters  for  matrices  (e.g.  A),  lower-case  bold  letters  for  vectors  (e.g.  a),  and 
calligraphic  fonts  for  sets  (e.g.  S)  and  we  denote  the  transpose  with  a  prime  (i.e..  A'  is 
the  transpose  of  A).  In  this  chapter,  we  focus  on  un-directed  un-weighted  graphs  which 
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Table  3.1:  Symbols 


Symbol 

Definition  and  Description 

A,  B, . . . 

matrices  (bold  upper  case) 

A(i,j) 

element  at  the  ith  row  and  jth  column  of  A 

A(i,:) 

ith  row  of  matrix  A 

A(:,j) 

jth  column  of  matrix  A 

I 

standard  n  x  n  identity  matrix 

a,  b, . . . 

column  vectors 

sets  (calligraphic) 

n 

number  of  nodes  in  the  graphs 

T 

number  of  different  alternating  behaviors 

Ai,  A2, . . . ,  At 

T  corresponding  size  n  x  n  symmetric 

alternating  adjacency  matrices 

(3 

virus  transmission  probability  in  the  SIS  model 

6 

virus  death  probability  in  the  SIS  model 

Am 

first  eigen-value  (in  absolute  value)  of  a  matrix  M 

um 

corresponing  first  eigen-vector  (for  Am)  of  a  matrix  M 

Pu 

probability  that  node  i  is  infected  at  time  t 

pt 

Pt  =  (Pl,t/P2,t/-../Pn,t)/ 

P2t+1 

probability  of  infection  vector  for  odd  days 

P2t 

probability  of  infection  vector  for  even  days 

ht 

the  expected  number  of  infected  nodes  at  time  t 

we  represent  by  the  adjacency  matrix  and  only  the  SIS  virus  propagation  model.  The 
SIS  model  is  the  susceptible /infected/ susceptible  virus  model  where  |3  is  the  probability 
that  an  infected  node  will  transmit  the  infection  over  a  link  connected  to  a  neighbor  and 
6  is  the  probability  with  which  an  infected  node  cures  itself  and  becomes  susceptible 
again.  Please  see  [HetOO]  for  a  detailed  discussion  on  SIS  and  other  virus  models. 

Consider  a  setting  with  clearly  different  behaviors  say,  day /night,  each  characterized 
by  a  corresponding  adjacency  matrix  (Ai  for  day,  A2  for  night),  then  what  is  the  epidemic 
threshold  under  a  SIS  virus  model?  More  generally,  the  problem  we  are  tackling  can  be 
formally  stated  as  follows: 

Problem  3.1.  Epidemic  Threshold 

Given:  (1)  T  alternating  behaviors,  characterized  by  a  set  of  T  graphs  A  =  {Ai,  A2  . . .  AT};  and 
(2)  the  SIS  model  [CWW+08]  with  vims  parameters  (3  and  5; 

Find:  A  condition,  under  which  the  infection  will  die  out  exponentially  quickly  (regardless  of 
initial  condition). 
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3.4  Epidemic  Threshold  on  Time-varying  Graphs 

To  simplify  discussion,  we  consider  T  =  2  in  Problem  3.1  with  A  to  consist  of  only  two 
graphs:  Gi  with  the  adjacency  matrix  Ai  for  the  odd  time-stamps  (the  'days')  and  G2 
with  the  adjacency  matrix  A2  for  the  even  time-stamps  (the  'nights').  Our  proofs  and 
results  can  be  naturally  extended  to  handle  any  arbitrary  sequence  of  T  graphs. 

3.4.1  The  NLDS 

We  first  propose  to  approximate  the  infection  dynamics  by  a  Non-linear  dynamical 
system  (NLDS)  representing  the  evolution  of  the  probability  of  infection  vector  (pt)  over 
time.  We  can  compute  the  probability  £t(t)  that  node  i  does  not  receive  any  infections  at 
time  t.  A  node  i  won't  receive  any  infection  if  either  any  given  neighbor  is  not  infected 
or  it  is  infected  but  fails  to  transmit  the  infection  with  probability  1  —  (3.  Assuming  that 
the  neighbors  are  independent,  we  get: 

C2t+1  M  =  ]^[  (Pj,2t+l(l  (3 )  +  (1  —  Pj,2t+l)) 

jexfiqi) 

=  Yl  (1  -  |3Ai(i,j)pj/2t+i))  (3.1) 

where  N£i(i)  is  the  set  of  neighbors  of  node  i  in  the  graph  Gi  with  adjacency  matrix  Ai. 
Similarly,  we  can  write  C2t+2(1)  as: 

C2t(i)  =  Yl  (Pj,2t(l  -  (3)  +  (1  -Pj,2t)) 

je^£2(i) 

=  II  (1  —  |3A2(i,  j)pj,2t))  (3.2) 


So,  pi/2t+i  and  pi/2t+2  are: 


1  -  Pi,2t+1  =  Spi,2t  +  (1  -  Pi,2tK2t(l) 

=>  Pi,2t+1  =  1  -  Spi,2t  -  (1  -  Pi,2t)  C2t(b)  (3.3) 


and 


1  —  Pi,2t+2  —  Spi^t+l  +  (1  ~  Pi,2t+l)C2t+l(i) 

Pi,2t+2  =  1  —  Spi,2t+1  —  (1  —  Pi,2t+l)C2t+l(f)  (3-4) 

Note  that  we  can  write  our  NLDS  as: 

P2t+1  =  g2(p2t) 

P2t+2  =  gi  (p2t+l ) 
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(3.5) 

(3.6) 


where  g1  and  g0  are  corresponding  non-linear  functions  as  defined  by  Equations  3.3  and 
3.4  (g1  depends  only  on  Ai  and  g2  on  A2). 

We  have  the  following  theorem  about  the  asymptotic  stability  of  a  NLDS  at  a  fixed 
point: 

Theorem  3.1.  (Asymptotic  Stability,  e.g.  see  [HS74])  The  system  given  by  pt+i  =  g(pt)  is 
asymptotically  stable  at  an  equilibrium  point  p*,  if  the  eigenvalues  of  the  Jacobian  J  =  yg(p*) 
are  less  than  1  in  absolute  value,  where, 

Jk,l  —  [V9lp  )] k,l  —  k  lpt=p* 

opi,t 

The  fixed  point  of  our  interest  is  the  0  vector  which  is  the  state  when  all  nodes  are 
susceptible  and  not  infected.  We  want  to  then  analyze  the  stability  of  our  NLDS  at 
P2t  =  P2t+i  =  0.  From  Equations  3.5  and  3.6,  we  get: 

^lP2t+1=o  =  (1  -  6)1+  (3Ai  =  Sa  (3.7) 

Op2t+l 

=  (1-5)I+3A2  =  S2  (3.8) 

3p2t 

Any  eigenvalue  Ag  of  Si  and  Xf  of  S2  (i  =  1,2,...)  is  related  to  the  corresponding 
eigenvalue  of  A2  and  A^o  of  A2  as: 

Asj  =  (1  -  6)  +  0A^  (3.9) 

As2  =  (1  -  5)  +  |3AL  (3.10) 

Recall  that  as  A2  and  A2  are  symmetric  real  matrices  (the  graphs  are  undirected), 
from  the  Perron-Frobenius  theorem  [McCOO],  AAl  and  Aa,  are  real  and  positive.  So,  from 
Equations  3.9  and  3.10  ASl  and  A§2  are  also  real  and  positive. 


3.4.2  The  Threshold 

We  are  now  in  a  position  to  derive  the  epidemic  threshold.  First,  we  have  the  following 
lemma: 

Lemma  3.1.  If  As  <  1,  then  p2t  dies  out  exponentially  quickly;  and  0  is  asymptotically  stable 
for  p2t,  where  Si  =  (1  —  6)1  +  |3Ai,  S?  =  (1  —  6)1  +  |3A2  and  S  =  Si  x  S2. 


Proof  Since  p2t+2  =  gi(g2(p2t))  (from  Equations  3.5  and  3.6),  we  have 


3p2t+2 

3p2t 


P2t=0 


3p2t+2  w  3p2t+W, 
3p2t+i  X  3p2t  jlp2t=0 


3p2t+2,  A  n/  ,  3p2t+l 
X  lp2t+l=oJ  ^  l- 5 

3p2t+i  3p2t 


SiS2  =  S 


(3.11) 
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The  first  equation  is  due  to  chain-rule,  second  equation  is  because  p2t  =  0  implies 
P2t+i  =  0;  and  the  final  step  is  due  to  Equations  3.7  and  3.8. 

Therefore,  using  Theorem  3.1,  we  get  that  if  As  <  1,  we  have  that  0  is  asymptotically 
stable  for  p2t. 

We  now  prove  that  p2t  in  fact  goes  down  exponentially  to  0  if  As  <  1.  To  see  this,  after 
linearizing  both  g1  and  g0  at  p2t  =  p2t+i  =  0,  we  have 

P2t+2  ^  Sip2t+l 

p2t+i  <  s2p2t  (3.12) 

Doing  the  above  recursively,  we  have 

P2t  ^  (SiS2)lp0  =  (Si^po  (3.13) 

Let  rjt  be  the  expected  number  of  infected  nodes  at  time  t.  Then, 

Tl2t  =  Ip2tll  ^  |(S)  Lp0|i 

^  l(S)t|i|p0|i  =  |(S)t|1rio 

<  VTx|(S)t|2rio  =  VnAgtio  (3.14) 

Therefore,  if  As  <  1,  we  have  that  q2t  goes  to  zero  exponentially  fast.  □ 

The  above  lemma  provides  the  condition  for  the  even  time-stamp  probability  vector 
to  go  down  exponentially.  But,  the  next  lemma  shows  that  this  condition  is  enough  to 
ensure  that  even  the  odd  time-stamp  probability  vector  to  go  down  exponentially. 

Lemma  3.2.  If  As  <  1,  then  p2t+i  dies  out  exponentially  quickly;  and  0  is  asymptotically  stable 
for  p2t+i,  where  Si  =  (1  —  5)1  +  |3Ai,  S2  =  (1  —  6)1  +  (3A2  and  S  =  Si  x  S2. 


Proof  Doing  the  same  analysis  as  in  Lemma  3.1,  we  can  see  that  the  condition  for  p2t+i 
to  be  asympotically  stable  and  die  exponentially  quickly  is: 


As2XSl  <  1  (3.15) 

Now  note  that  as  Si  and  S2  are  invertible:  Si  x  S2  =  Si  x  S2  x  Si  x  S)-1.  But  this  implies 
that  S2  x  Si  is  similar  to  Si  x  S2  (matrix  P  is  similar  to  Q  if  P  =  BQB-1,  for  some  invertible 
B).  We  know  that  similar  matrices  have  the  same  spectrum  [GVL89],  thus  S2  x  Si  and 
Si  x  S2  have  the  same  eigenvalues.  Hence,  the  condition  for  exponential  die  out  of  p2t+i 
and  asymptotic  stability  is  the  same  as  that  for  p2t  which  is  As  <  1.  □ 

Lemma  3.1  and  Lemma  3.2  imply  that  this  threshold  is  well-defined  in  the  sense  that 
the  probability  vector  for  both  the  odd  and  even  time-stamps  go  down  exponentially. 
Thus  we  can  finally  conclude  the  following  theorem: 


Theorem  3.1.  (Epidemic  Threshold)  If  As  <  1,  then  p2t  and  p2t+i  die  out  exponentially 
quickly;  and  0  is  asymptotically  stable  for  both  p2t  and  p2t+i,  where  Si  =  (1  —  6)1  +  |3Ai, 
S2  =  (1  —  6)1  +  (3A2  and  S  =  Si  x  S2.  Similarly  for  any  general  T,  the  condition  is: 


(3.16) 


where  Vi  e  {1,2, ..,  T}  Si  =  (1  —  6)1  +  (3  At. 
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We  call  S  as  the  system-matrix  of  the  system;  thus,  the  first  eigenvalue  of  the  system- 
matrix  determines  whether  a  given  system  is  below  threshold  or  not. 


3.5  Salient  Points 

Sanity  check:  Clearly,  when  T  =  1,  the  system  is  equivalent  to  a  static  graph  system 
with  Ai  and  virus  parameters  (3,6.  In  this  case  the  threshold  is  (from  Theorem  3.1) 
A(i_5)i+pAl  <  1  =>•  |3AAl/6  <  1  i.e.  we  recover  the  known  threshold  in  the  static 
case  [CWW+08]. 

A  conservative  condition:  Notice  that  from  Equations  3.7  and  3.8  and  Theorem  3.1,  for 
our  NLDS  to  be  fully  asymptotically  stable  at  0  (i.e.  pt  decays  monotonically),  we  need 
the  eigenvalues  of  both  Si  and  S2  be  less  than  1  in  absolute  value.  Hence,  (3 A/6  <  1 
where  A  =  max(AA1,AA2)  is  sufficient  for  full  stability.  Intuitively,  this  argument  says 
that  the  alternating  sequence  of  graphs  can  not  be  worse  than  static  case  of  having 
the  best-connected  graph  of  the  two  repeated  indefinitely.  Let  AAl  >  Aa2.  Consider 
a  sequence  of  graphs  S  =  {Ax,  Ai . . .}  repeating  indefinitely  instead  of  our  alternating 
{Ai,  A2,  Ai,  A2, . . .}  sequence.  Clearly,  if  an  infection  dies  exponentially  in  8,  then  it  will 
die  exponentially  in  our  original  alternating  sequence  as  well  because  AA[  >  AAz.  The 
set  S  is  essentially  just  the  static  graph  case:  hence,  if  |3AAl/6  =  (3 A/6  <  1,  then  0  is 
asymptotically  stable  for  pt.  The  case  when  AAl  <  Aa,  is  similar.  But  this  notion  of  a 
threshold  is  too  stringent  and  conservative:  it  can  happen  that  a  stronger  virus  can  still 
lead  to  a  general  exponential  decrease  instead  of  a  strict  monotonous  decrease.  This  is 
because  we  forced  the  eigenvalues  of  both  Si  and  S2  to  be  less  than  1  in  absolute  value 
here,  when  we  can  probably  get  away  with  less.  Theorem  3.1  precisely  formalizes  this 
idea  and  gives  us  a  more  practical  condition  for  a  general  decreasing  trend  of  every 
corresponding  alternating  time-stamp  values  decaying.  We  illustrate  this  further  in  the 
experiments. 


3.6  Experiments 

We  will  demonstrate  our  result  in  this  section  using  simulation  experiments.  We  con¬ 
ducted  a  series  of  experiments  using  the  MIT  Reality  Mining  data  set  [EPL09].  The 
Reality  Mining  data  consists  of  104  mobile  devices  (cellular  phones)  monitored  over  a 
period  of  nine  months  (September  2004  -  June  2005).  If  another  participating  Bluetooth 
device  was  within  a  range  of  approximately  5-10  meters,  the  date  and  time  of  the  con¬ 
tact  and  the  device's  MAC  address  were  recorded.  Bluetooth  scans  were  conducted  at 
5-minute  intervals  and  aggregated  into  two  12-hour  period  adjacency  matrices  ( day  and 
night).  The  epidemic  simulations  were  accomplished  by  alternating  the  day  and  night 
matrices  over  the  period  of  simulation.  All  experiments  were  run  on  a  64-bit,  quad-core 
(2.5Ghz  each)  server  running  a  CentOS  linux  distribution  with  shared  72  GB  of  RAM. 
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(A)  Infected  Fraction  Time  Plot  (lin-log)  (B)  Max.  Infections  till  steady 

state  vs  As  (lin-log) 

Figure  3.1:  SIS  simulations  on  our  synthetic  example  (all  values  averages  over  20  runs)  (A) 
Fraction  of  nodes  infected  vs  Time-stamp  (lin-log  scale).  Note  the  qualitative  difference  in 
behavior  under  (green)  and  above  (red)  the  threshold.  Also,  note  that  the  green  line  is  below  the 
threshold  but  is  actually  above  the  conservative  threshold  (|3A/5  =  1.100  here).  Hence  while  both 
P2t  and  p2t+i  decrease  exponentially  separately,  but  pt  itself  does  not  monotonously  go  down. 
(B)  Plot  of  Max.  number  of  infected  nodes  till  steady  state  vs  A§  (by  varying  (3)  (lin-log).  As 
predicted  by  our  results,  notice  that  there  is  a  sudden  'take-off'  and  a  change  of  behavior  of  the 
curve  exactly  when  As  =  1. 

Simulations  were  conducted  using  a  combination  of  Matlab  7.9  and  Python  2.6. 

Figures  3.1  and  3.2  demonstrate  our  result  on  a  synthetic  example  and  graphs  from 
MIT  reality  data.  In  the  synthetic  example,  we  have  100  nodes,  such  that  Gi  is  a  full 
clique  (without  self  loops)  whereas  G2  is  a  chain.  All  values  are  average  over  several 
runs  of  the  simulations  and  the  infection  is  started  by  infecting  5  nodes.  In  short,  as 
expected  from  the  theorem,  the  difference  in  behavior  above,  below  and  at  threshold  can 
be  distinctly  seen  in  the  figures. 

Figures  3.1(A)  and  3.2(A)  show  the  time-plot  of  number  of  infections  for  As  values 
above  and  below  the  threshold.  While  above  threshold  the  infection  reaches  a  steady 
state  way  above  the  starting  point,  below  threshold  it  decays  fast  and  dies  out.  In  case  of 
Figure  3.1(A),  also  note  the  the  difference  between  the  conservative  threshold  and  our 
threshold.  The  green  curve  is  belozv  our  threshold  but  above  the  conservative  threshold. 
But  again,  as  predicted  from  our  theorems,  clearly  while  there  are  dampening  oscillations 
and  the  infection  decays  but  pt  itself  does  not  monotonously  go  down  (and  hence  the 
"bumpy"  nature  of  the  curve).  This  exemplifies  the  practical  nature  of  our  threshold 
and  its  usefulness  as  we  are  more  concerned  with  the  general  trend  of  the  number 
of  infections  curve  and  not  every  small  'bump'  because  of  the  presence  of  alternating 
graphs. 

Figures  3.1(B)  and  3.2(B)  show  a  'take-off'  plot  showing  max.  number  of  infections 
till  steady  state  (intuitively  the  'footprint')  for  different  values  of  As  (by  varying  (3).  As 
predicted  by  our  theorem,  note  the  sudden  steep  change  and  spike  in  the  size  of  the 
footprint  when  As  =  1  in  both  the  plots. 
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(A)  Infected  Fraction  Time  Plot  (lin-log)  (B)  Max.  Infections  till  steady 

state  vs  As  (lin-log) 

Figure  3.2:  SIS  simulations  on  the  MIT  reality  mining  graphs  (all  values  averages  over  20  runs) 
(A)  Fraction  of  nodes  infected  vs  Time-stamp  (lin-log  scale).  Note  the  qualitative  difference  in 
behavior  under  (green)  and  above  (red)  the  threshold.  (B)  Plot  of  Max.  number  of  infected  nodes 
till  steady  state  vs  As  (by  varying  |3)  (lin-log).  As  predicted  by  our  results,  notice  that  there  is  a 
sudden  'take-off'  and  a  change  of  behavior  of  the  curve  when  As  =  1. 

3.7  Discussion — Generality  of  our  results 

How  can  we  model  more  complex  situations  like  'unequal  duration'  behaviors  etc.?  Note 
that  the  alternation  period  T  can  be  longer  than  2  and  we  can  have  repetitions  in  the  set  A 
as  well  e.g.,  to  represent  a  weekly-style  (work  day-weekend)  alternation  we  can  have 
T  =  7  and  A  —  {Ai,  Ai, . . . ,  Ai,  A2 ,  A2}.  Similarly,  we  can  model  situations  like  unequal 
duration  of  'day'  and  'night'  i.e.  unequal  duration  of  matrices  Ai  and  A2.  Say,  Ai  is 
present  for  only  8  hours  at  work,  while  A2  is  present  for  the  remaining  16  hours  at  home. 
Then,  thinking  of  an  hour  as  our  time-step  i.e.  T  =  24,  the  set  A  =  {Ax, . . . ,  A,,  A2,  A2, . . .}, 
where  Ai  occurs  8  times  in  A  while  A2  occurs  16  times.  All  the  threshold  results  carry 
forward  seamlessly  in  all  the  above  cases. 


3.8  Conclusion 

In  this  chapter,  we  analytically  studied  virus-spreading  (specifically  the  SIS  model)  on 
arbitrary,  time-varying  graphs.  Given  a  set  of  T  alternating  graphs,  modeling  e.g.  the 
day/ night  pattern  of  human  behavior,  we  ask:  what  is  the  epidemic  threshold?  Our 
main  contributions  are: 

1.  We  show  how  to  formulate  the  problem,  namely  by  approximating  it  with  a  Non- 
Linear  Dynamical  System  (NLDS). 

2.  We  give  the  first  closed-formula  for  the  threshold,  involving  the  first  eigenvalue  of 
the  system-matrix  (see  Theorem  3.1). 
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Chapter  4 

Competing  Viruses:  Winner  Takes  All 


We  have  assumed  till  now  that  a  single  virus  is  spreading  in  isolation,  without  any 
competition.  Instead,  in  this  chapter  we  look  into  understanding  the  spread  of  multiple 
competing  viruses.  Given  two  competing  products  (or  memes,  or  viruses  etc.)  spreading 
over  a  given  network,  can  we  predict  what  will  happen  at  the  end,  that  is,  which  product 
will  'win',  in  terms  of  highest  market  share?  One  may  naively  expect  that  the  better 
product  (stronger  virus)  will  just  have  a  larger  footprint,  proportional  to  the  quality  ratio 
of  the  products  (or  strength  ratio  of  the  viruses).  However,  we  prove  the  surprising  result 
that,  under  realistic  conditions,  for  any  graph  topology,  the  stronger  virus  completely 
wipes-out  the  weaker  one,  thus  not  merely  'winning'  but  'taking  it  all'.  In  addition  to 
the  proofs,  we  also  demonstrate  our  result  with  simulations  over  diverse,  real  graph 
topologies,  including  the  social-contact  graph  of  the  city  of  Portland  OR  (about  31  million 
edges  and  1  million  nodes)  and  internet  AS  router  graphs.  Finally,  we  also  provide  real 
data  about  competing  products  from  Google-Insights,  like  Facebook-Myspace,  and  we 
show  again  that  they  agree  with  our  analysis. 


4.1  Introduction 

Given  two  competing  products  like  iPhone /Android  or  Blu-ray/HD-DVD,  and  'word  of 
mouth'  adoption  of  them,  what  will  happen  in  the  end?  This  question  is  of  interest  in 
numerous  settings.  For  example,  in  a  biological  virus  setting,  we  have  the  common  flu 
versus  avian  flu.  In  a  computer  virus  setting,  clever  virus  authors  make  sure  that  their 
code  eliminates  most  other  computer  viruses  from  the  victim's  disk.  The  list  continues, 
with  competing  scientific  theories,  competing  memes  ('coke'  vs  'soda'  vs  'pop'),  and 
many  more. 

Our  main  result  is  that  we  answer  the  above  question  analytically,  and  we  show  that 
'winner  takes  all'  (WTA),  or,  more  accurately,  the  weaker  product/ virus  will  soon  become 
extinct.  The  fate  of  the  stronger  virus  depends  on  its  strength:  below  the  epidemic 
threshold  (more  details,  later),  it  will  also  become  extinct,  but  above  that  it  has  good 
chances  of  lingering  practically  for  ever.  In  more  detail,  we  assume 
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Figure  4.1:  (a)  Number  of  infected  vs  time  for  simulations  on  the  AS-OREGON  network  for  a  virus 
propagating  in  isolation  (brown  square)  -  note  that  it  is  above  the  epidemic  threshold  and  hence 
doesn't  die  out,  and  the  same  virus  competing  with  an  even  stronger  virus  (green)  -  note  that  it 
now  dies-out  completely  (red),  (b)  Winner  takes  all  in  search  interest  data  from  Google-Insights 
for  Facebook  (green)  and  Myspace  (red).  Even  though  Myspace  got  a  head-start,  Facebook  wiped 
it  out. 


(a)  an  SIS-like  model  (no  immunity,  like  flu), 

(b)  perfect  mutual  immunity  (a  node  can  have  at  most  one  of  the  viruses /products,  at 
any  given  time), 

(c)  the  underlying  network  is  connected  (every  node  can  reach  every  other  node) 

(d)  the  network  is  'fair-play',  in  the  sense  that  all  nodes  have  the  same  behavior  towards 
the  two  competing  products /viruses:  everybody  has  the  same  probability  |3i  of 
getting  infected  with  virus-1  by  a  sick  neighbor,  and  similarly  for  virus-2,  and  for 
the  recovery  times. 

One  of  the  main  contributions  is  that  our  theoretical  analysis  holds  for  any  graph 
topology,  while  earlier  work  focuses  only  on  specific-topology  graphs  (cliques,  random, 
etc). 

Figure  4.1(a)  gives  an  illustration  of  our  result:  it  shows  the  number  of  infected  nodes 
vs  time  for  computer  simulations  on  the  AS-OREGON  network  (see  Section  4.5  for  details) 
for  a  'above-threshold'  virus  propagating  in  isolation  (brown  square)  in  one  case  and 
the  virus  competing  with  an  even  stronger  virus  (green)  in  another  case.  Clearly  it  is 
wiped  out  during  the  competition,  although  it  gave  a  fight  (red,  note  the  bump).  Note 
that  though  both  the  viruses  are  above  the  threshold,  the  weaker  virus  is  wiped  out.  We 
prove  this  result  for  arbitrary  underlying  networks  in  this  chapter. 

Figure  4.1(b)  shows  the  time  evolution  of  search-interest  for  a  pair  of  competing 
products  Facebook-Myspace.  The  data  came  from  Google-Insights.  Notice  that  again, 
the  weaker  competitor  is  extinct  (or  close  to  that).  We  will  give  more  case-studies  later  in 
Section  4.5. 

The  outline  of  the  chapter  is  as  follows:  we  review  related  work  in  §  4.2  and  formulate 
the  problem  giving  details  of  our  model  in  §  4.3.  We  give  the  analysis  and  proof  of  our 
WTA  result  in  §  4.4  while  we  demonstrate  it  using  simulations  and  real  case-studies  in 
§  4.5.  Finally,  we  discuss  some  subtle  issues  in  §  4.6  and  conclude  in  §  4.7. 
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4.2  Related  Work 


We  present  the  related  work  in  this  section,  which  can  be  categorized  into  three  parts: 
epidemic  thresholds,  information  diffusion  and  ecology.  We  have  already  summarized 
prior  work  in  epidemic  thresholds  for  single  viruses  before.  Here  we  give  the  rest  of  the 
related  work.  Most  of  these  works  either  consider  only  single  virus  models  or  typically 
use  only  simulation  or  analyze  on  very  restricted  underlying  networks. 

Information  Diffusion  Broadly  two  classes  of  information  cascade  models  have  been 
proposed  (a)  independent  cascade  (IC)  [KKT03]  (essentially  a  'SIR'  model)  and  (b)  linear 
threshold  (LT)  [Gra78].  Research  work  in  multiple  cascades  has  looked  into  extensions 
of  the  IC  model  with  the  restriction  that  nodes  can't  switch  from  one  competitor  to 
the  other  [BKS07,  KOW08].  One  of  the  few  works  to  consider  switching  between  the 
competitors  is  Pathak  et.  al.  [PBS10].  However,  their  work  differs  from  ours  in  several 
important  aspects,  as  they:  (a)  use  the  LT  model,  as  opposed  to  the  'flu-like'  SIS  model 
(a  cascade  style  model)  we  use;  (b)  assume  that  nodes  may  randomly  switch  between 
products;  (c)  do  not  find  WTA  phenomena;  and  (d)  give  no  closed-form  results  -  only  an 
algorithm  to  compute  the  steady  state. 

Ecology  In  ecology,  the  principle  of  'competitive  exclusion'  espouses  that  two  species  can 
not  occupy  the  same  ecological  niche  in  the  long  term.  Research  has  gone  into  studying 
this  using  various  propagation  models  like  SIS,  SIR,  Lotka-Volterra  [CCHL96,  CCHL99, 
AA05,  AM82],  They  typically  did  simulations,  or  only  studied  homogenous  or  structured 
topologies  like  cliques. 

Distinguishing  features  of  current  work:  In  short,  none  of  the  previous  work  fulfills 
all  the  conditions  of  this  current  work:  (a)  analytical  proof  of  'WTA'  (b)  in  arbitrary 
topologies  (c)  under  a  SIS-like  model. 


4.3  Problem  Formulation 

In  this  section,  we  formulate  our  problem,  giving  details  about  the  model  used  and  the 
assumptions.  Table  4.1  explains  the  terminology  we  have  used  in  the  chapter.  Bold  letters 
typically  denote  matrices  (A,  C  etc.)  or  vectors  (P,u  etc.). 

4.3.1  The  propagation  model 

We  assume  that  the  competing  viruses  are  spreading  on  the  network  according  to  a 
propagation  model,  which  we  describe  next.  We  call  our  propagation  model  Six I2S, 
based  on  the  popular  "flu-like"  SIS  (Susceptible-Infected-Susceptible)  model  [HetOO]. 
SI1I2S  denotes  Susceptible  -  Infectedi  -  Infected2  -  Susceptible.  Each  node  in  the  graph 
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Table  4.1:  Terms  and  Symbols 


Symbol 

Definition  and  Description 

WTA 

Winner-Takes- All 

SIiI2S 

our  competing  viruses  model 

Pi(or  |32) 

attack  rate  of  virus  1  (or  virus  2) 

Si  (or  62) 

cure  rate  of  virus  1  (or  virus  2) 

A 

adjacency  matrix  of  the  underlying  graph 

Am 

set  of  eigenvalues  of  the  matrix  M 

Ai(M) 

largest  eigenvalue  of  matrix  M 

A 

Ai(A) 

or 

A(3i/5i  (strength  of  virus  1) 

a2 

A|32/S2  (strength  of  virus  2) 

Mt 

transpose  of  M 

NE(i) 

set  of  neighbors  of  node  i  in  the  graph 

I 

identity  matrix  of  appropriate  size 

0 

all-zeros  matrix  of  appropriate  size 

diag(P) 

the  diagonal  matrix  with  elements  of  vector  P  in  the  diago¬ 
nal 

can  be  in  one  of  three  states:  Susceptible  (healthy),  h  (infected  by  virus  1),  or  I2  (infected 
by  virus  2).  The  state  transition  diagram  as  seen  from  a  node  in  the  network  is  shown  in 
Figure  4.2.  We  could  have  extended  other  single  virus  models  as  well,  but  we  believe 
that  our  model  is  a  reasonable  starting  point,  and  we  leave  the  analysis  of  other  models 
as  future  work. 

Healing  (virus  death)  rate:  6.  If  a  node  is  in  state  I]  (or  I2),  it  recovers  on  its  own 
with  rate  5i  (or  52).  This  implies  that  the  time  taken  for  each  infected  node  to  heal 
is  exponentially  distributed  with  parameter  Si  (or  S2).  This  parameter  captures  the 
persistence  of  the  virus  in  an  inverse  way:  a  high  6  means  low  persistence.  For  example. 


Figure  4.2:  State  Diagram  for  a  node  in  the  graph  under  our  SIi I2S  competing  viruses  model. 
The  node  is  in  the  S  state  if  it  doesn't  have  either  competitor  (say  iPhone  or  Android).  It  is  in  h  if 
it  gets  the  iPhone  (virus  1)  and  is  in  I2  if  it  gets  the  Android  (virus  2).  The  transitions  from  S  to  h 
or  I2  (red-curvy  arrows)  depend  on  the  infected  neighbors  of  the  node.  The  remaining  transitions, 
in  contrast,  are  self-transitions,  without  the  aid  of  any  neighbor. 
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a  very  convincing  rumor  that  sticks  to  one's  mind  will  be  modeled  with  a  low  5  value. 

Attack  (virus  transmission)  rate:  (3.  A  healthy  node  gets  infected  by  infected  neigh¬ 
bors,  and  the  virus  transmissability  is  captured  by  (3i  and  (32-  Specifically,  an  infected 
node  transmits  its  infection  to  each  of  its  neighbors  independently  at  rate  |3i  (or  (32).  Hence, 
the  time  taken  for  each  infected  node  to  transmit  the  virus  to  a  neighbor  over  a  link  is 
exponentially  distributed  with  parameter  |3i  (or  (32). 

This  is  a  novel  generalization  of  the  single-virus  SIS  model  to  a  competing-viruses 
scenario.  Note  the  competition  between  the  viruses:  each  virus  has  to  compete  with  the 
other  for  healthy  victims.  Moreover,  note  that  we  assume  full  mutual  immunity:  while  a 
node  is  infected  by  one  virus,  it  cannot  be  infected  by  the  other. 

Fair-play:  We  assume  that  the  competitors  are  playing  a  'fair  game':  All  nodes  in 
the  network  have  the  same  model  parameters  (|3's  and  5's)  for  each  of  the  viruses  and 
behave  according  to  the  state-diagram  in  Figure  4.2. 

4.3.2  Problem  Statement 

We  are  now  in  a  position  to  state  the  problem  formally.  We  assume  the  underlying 
network  is  connected  -  otherwise  we  just  have  separate  disconnected  problems. 

Competing  viruses  problem 

Given:  A  undirected  connected  graph  G,  and  the  propagation  model  (SI1I2S)  parameters 
(|3i,  61  for  virus  1,  (32,  62  for  virus  2) 

Find:  What  will  happen  at  the  end  i.e.  what  are  the  steady-state  populations  of  the  two 
viruses. 


4.4  WTA:  Results  and  Proofs 

We  prove  our  winner-takes-all  (WTA)  result  on  an  arbitrary  undirected  graph  in  this 
section.  Our  main  result  can  be  formally  stated  as  follows: 

Theorem  4.1  (Winner  takes  all).  Given  an  arbitrary  undirected,  connected  graph  with  adjacency 
matrix  A  and  the  SI1I2S  model  parameters  (|3i,  (32, 61, 62)/  then  virus  1  will  dominate  and  virus 
2  will  completely  die-out  in  the  steady  state  if  virus  1  is  above  threshold 1  and  the  strength  of  virus 
1  is  greater  than  the  strength  of  virus  2  i.e.  if  0 1  >1  and  cq  >  a2. 

The  proof  is  involved,  and  we  present  it  in  the  next  few  pages.  We  will  first  prove  it 
for  simpler  cases  of  the  underlying  network  -  namely  a  clique  and  a  barbell  before  we 
move  on  to  arbitrary  graphs. 

1  As  it  follows  from  Chatper  2,  for  the  single-virus  SIS  model,  a  virus  dies-out  unless  it  is  above  the 
epidemic  threshold  i.e.  unless  |3A/6  >  1,  where  A  is  the  largest  eigenvalue  of  the  adjacency  matrix  of  the 
underlying  graph. 
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4.4.1  Proof  roadmap 

In  short,  the  proof  has  the  following  steps: 

1.  Dynamical  System:  construct  a  suitable  dynamical  system  of  differential  equations 
for  the  propagation  process, 

2.  Fixed  Points:  prove  that  there  are  only  three  fixed  points  and  at  least  one  of  the 
viruses  has  to  die  out  at  any  fixed  point,  and 

3.  Stability  Conditions:  give  the  precise  conditions  for  each  fixed  point  to  be  stable 
(attracting). 

Intuitively,  the  dynamical  system  generates  a  field  on  which  we  show  that  only  3 
possible  fixed  points  can  exist.  Moreover  the  field  makes  only  one  of  the  possible  fixed 
points  stable  under  any  given  scenario.  Figures  4.3(a-c)  shows  the  field-plots  in  a  simple 
case  -  when  the  underlying  graph  is  a  clique2  of  size  N  =  1000.  Specifically  we  show 
three  scenarios  (wlog,  we  assume  the  first  virus  is  the  stronger  virus): 

BELOW  :  1  >  PiN/Si  =  0.6  >  |32N/S2  =  0.2 
(both  viruses  below  the  threshold) 

MIXED  :  PiN/Si  =  6  >  1  >  |32N/S2  =  0.2 

(one  above  and  one  below  the  threshold) 

ABOVE  :  PiN/Si  =  6  >  |32N/62  =  4  >  1 
(both  above  the  threshold) 

The  field  plots  illustrate  the  fixed  points  in  this  setting  and  their  stability.  In  this  case, 
we  have  a  2-dimensional  field,  but  for  an  arbitrary  graph  it  will  depend  on  the  number 
of  nodes  in  the  graph.  At  any  point  on  the  field,  the  direction  of  the  field-arrow  tells  us 
where  the  system  will  go  next.  Stable  fixed  points  are  marked  by  bold  circles,  unstable 
fixed  points  by  hollow  circles,  x-axis  denotes  the  #  of  infected  nodes  by  virus  2  and  the 
y-axis  denotes  the  #  infected  by  virus  1  (the  stronger  virus).  For  example,  in  Figure  4.3(c), 
both  viruses  are  above  threshold,  yet  the  FPi  and  FP3  points  are  unstable  while  the  other 
fixed  point  corresponding  to  the  stronger  virus  winning  (FP2)  is  stable.  The  trajectory 
of  the  simulation  is  overlaid  on  the  field  plots  -  we  can  see  that  the  system  follows  the 
field  lines  and  is  attracted  towards  and  ends  up  at  the  stable  fixed  point  in  the  steady 
state.  We  also  show  the  time-evolution  separately  in  Figures  4.3(d-f)  -  especially  note 
part  (f)  (ABOVE),  virus  2  tries  to  take  over,  but  is  over-powered  by  virus  1  which  goes  on 
to  dominate.  We  can  similarly  observe  the  BELOW  and  MIXED  scenarios  as  well. 

We  elaborate  a  bit  more  on  the  steps  next.  Consider  a  dynamical  system  (set  of 
differential  equations)  of  the  form  x'  =  F(x),  where  x'  is  the  (component-wise)  time 
derivative  of  x,  and  P  :  Mn  — »  Mn  is  continuous  and  differentiable.  If  F(xo)  =  0,  then 
x0  is  a  stationary  point  (also  called  a  fixed  point).  The  proof  begins  by  setting  up  the 
propagation  as  a  dynamical  system  of  non-linear  differential  equations  and  then  analyzes 
the  possible  fixed  points  and  their  stability  conditions.  In  principle,  one  might  expect  that 

2every  node  is  connected  to  every  other  node. 
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Figure  4.3:  (a-c)  Field  plots  for  clique  of  size  1000  for  various  cases.  Stable  fixed  points  are  marked 
by  bold  circles,  unstable  fixed  points  by  hollow  circles,  while  the  y-axis  denotes  the  number  of 
infected  nodes  by  virus  1  and  the  x-axis  denotes  the  number  infected  by  virus  2.  The  simulation 
trajectory  is  overlaid.  The  field  plots  illustrate  the  fixed  points  in  this  setting  and  their  stability, 
(d-f)  Corresponding  Time  evolution  plots  of  the  competition.  Note  that  virus  2  (red  curve)  in  (f) 
could  have  won  in  isolation,  but  lost  to  virus  1. 


there  might  be  several  fixed  points  of  the  system  corresponding  to  different  proportions 
of  the  populations  of  the  two  virus.  But  we  prove  that  in  fact  there  are  always  only  three 
fixed  points  possible  and  in  each  at  least  one  virus  gets  wiped-out. 

Further,  intuitively,  if  a  fixed  point  is  not  stable  then  the  system  would  be  repelled 
whenever  it  tries  to  approach  that  fixed  point.  Hence,  to  fully  characterize  the  fixed 
points,  we  need  to  derive  the  stability  conditions,  which  give  us  the  conditions  for  each 
of  these  fixed  points  to  be  stable  and  attracting. 

For  characterizing  the  stability  of  the  fixed  points,  we  use  a  well-known  result  from 
dynamical  system  theory  (c.f.  [HS74]).  The  fixed  points  will  be  a  hyperbolic  fixed  point 
(i.e.  where  the  linearized  stability  analysis  can  be  performed)  only  when  none  of  the 
eigenvalues  of  the  corresponding  Jacobian3  has  a  zero  real  part.  Further,  the  system  will 
be  stable  at  a  hyperbolic  fixed  point  (attractor)  only  if  the  real  part  of  the  eigenvalues 
of  the  Jacobian  is  negative.  So  to  make  a  fixed  point  a  hyperbolic  stable  attractor  we 
need  to  impose  the  condition  that  the  real  parts  of  all  the  eigenvalues  of  the  corresponding 
Jacobian  should  be  negative.  We  next  show  this  proof  scheme  for  the  special  case  of  a  clique 

3The  Jacobian  is  the  matrix  of  all  component-wise  first-order  partial  derivatives  of  x'  with  respect  to  x 
evaluated  at  the  fixed  point. 
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topology. 


4.4.2  Special  case:  Clique  Topology 

In  a  clique,  all  nodes  are  connected  to  each  other  with  undirected  and  unweighted  edges. 
Each  node  is  identical  to  the  other  and  hence  our  system  is  a  simple  continuous  time 
markov  chain,  due  to  which  we  can  write  down  the  system  equations  directly.  Let  N  be 
the  size  of  the  clique  and  L  be  the  number  of  nodes  infected  by  virus  1  at  some  time  t. 
Similarly  define  I2. 

Dynamical  System:  Clearly,  under  our  SEES  model,  we  have  the  following  system 
equations: 

=  MN~ 

f  -  MN- 

Fixed  Points  There  are  three  fixed  points  of  the  system  of  differential  equations  above 
(when  the  rates  of  change  in  h  and  I?  are  zero): 

1.  {Ii  — )■  0, I2  -»  0}  (i.e.  the  viruses  die-out) 

2.  jli  — »  N  —  !^,  I2  — >  0  j  (i.e.  only  virus  1  survives) 

3.  jl2  — >  N  —  |^,  Ii  — >  0  j  (i.e.  only  virus  2  survives) 

Stability  Conditions  The  corresponding  Jacobians  at  the  fixed  points  are: 

r  Npi-Sj  o 

Jl  [  0  N|32-S2  _ 

2  ]2  =  -61  +  [3,(n-2(n-E))  -(5,(N-£)  ' 

0  -  62 
'  -5i  +  ^  0 

3'  h  =  [  -02  (n  -  |)  -S2  +  132  (n  -  2  (n  -  f|))  _ 

The  eigenvalues  of  the  Jacobians  can  be  seen  to  be: 

1.  AJl  =  {|31(N-i),|32(N-||)} 

2.  ^  =  {Pi(&-N),fS2(£-£)} 

From  our  preceding  discussion  we  know  that  to  have  stable  fixed  points  we  require 
that  the  real  part  of  the  eigenvalues  of  the  Jacobians  should  be  negative.  Clearly  the 
corresponding  conditions  for  the  fixed  point  to  be  (a)  hyperbolic  and  (b)  stable  attractor 
are: 


l  —  Oh  —  SiE 
l  —  OE  —  52I2 
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1. 

2. 

3. 


^  <  1  and  <  1 

Oi  62 

(i.e.  both  are  below  threshold) 

Pin  >  i  and  >  Mi 
0^  0^  02 

(i.e.  virus  1  is  above  threshold  and  virus  1  strength  is  greater  than  virus  2) 

%^>land^>^ 

O2  O2  Oi 

(i.e.  virus  2  is  above  threshold  and  virus  2  strength  is  greater  than  virus  1) 


Firstly  note  that  we  recover  a  result  similar  to  the  single- virus  SIS  model  case  -  that  if  the 
viruses  are  below  the  epidemic  threshold,  they  both  die-out.  Secondly,  we  can  conclude 
that  in  case  of  a  clique,  the  stronger  virus  wipes-out  the  weaker  virus  if  it  is  above  the 
epidemic  threshold. 


N on-hyperbolic  fixed  points:  We  can  see  that  the  fixed  points  will  be  non-hyperbolic  if  the 
virus  strengths  are  equal.  Hence,  in  this  case,  no  conclusions  can  be  drawn  from  the 
linearized  analysis  and  we  take  a  different  route.  Note  that  we  always  have: 


di 


dh 

Pili 


rt 


£dt 

0  Pi 


if2  (I°)fh 

' x  WF 


T  Pi 

x2 


dh 


r  I2 

1°  P2I2 


rt 


+ 


£dt 

0  P2 


(4.1) 

(4.2) 


where  I)1  and  1°  are  the  initial  values  of  I  ]  and  h-  The  R.H.S.  will  evaluate  to  one  in  our 
case  (the  virus  strengths  are  equal).  Hence  now,  the  ratio  of  virus  populations  at  any 
given  time  t  will  be  directly  proportional  to  the  initial  ratio  (up  to  some  exponents).  Also, 
clearly,  the  maximal  ratios  are  attained  at  one  of  the  last  two  fixed  points. 


4.4.3  Special  Case:  Barbell  Graph 

A  barbell  graph  G  has  two  cliques  (say  clique  Ci  and  C2  of  size  N  each)  connected 
through  weak  edges.  Specifically,  we  assume  that  all  nodes  in  Ci  are  connected  to  all 
nodes  in  C2  (and  vice  versa)  with  edges  of  weight  e,  whereas  they  are  connected  with 
nodes  within  the  same  clique  with  edges  of  weight  1.  In  this  case,  by  symmetry  we  can 
see  that  the  virus  populations  in  both  the  cliques  should  remain  the  same  at  the  steady 
state.  If  we  follow  the  steps  in  the  case  of  a  single  clique,  we  get  at  steady  state: 

p^N-h-^IiU  +  e)  =  brh 
p2(N-I1-I2)I2(l  +  e)  =  52I2 

Hence,  the  only  possible  fixed  points  are: 

1.  {Ii  — y  0, 12  — >  0}  (i.e.  the  viruses  die-out) 

2.  jli  — *  N  —  y  I2  — >  0  j  (i.e.  only  virus  1  survives) 

3.  |l2  — >  N  —  p2^^+e  ),  Ii  — >•  0  j  (i.e.  only  virus  2  survives) 
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Moreover,  continuing  similarly  as  the  single  clique  case,  we  can  see  that  the  stronger 
virus  again  wipes-out  the  weaker  virus  as  long  as  it  is  above  the  epidemic  threshold 
(note  that  in  this  case  A  =  (1  +  e)N,  hence  the  threshold  condition  for  a  single  virus  is 

(1  +  e)|3N/6  >  1). 

4.4.4  General  Arbitrary  Graph 

Let  A  be  the  adjacency  matrix  of  the  arbitrary  graph  of  N  nodes.  Let  pki  be  the  probability 
of  node  i  to  be  in  the  h  state.  Similarly  define  pi/2  and  S|  is  the  probability  of  node  i  being 
in  the  susceptible  state.  Clearly,  Si  +  pLj  +  p  l/2  =  1. 

Dynamical  System:  As  we  have  a  continuous  time  process,  the  following  system  equa¬ 
tions  hold  for  each  node  i: 

=  -5!PU  +  |3i(l  -pu  —  Pi,2)  Y  (Aqlpi) 

j 

=  — 62pi/2  +  (32(1  —  Pu  —  Pu)  y~  (Ajjlj/2) 

j 

where  lj/k  (for  k  =  1,2)  is  the  indicator  random  variable  denoting  if  node  j  is  infected 
with  virus  k.  Our  system  is  not  a  markov  chain  due  to  the  presence  of  random  variables 
lj/k  in  the  rate  equations.  But  after  making  a  mean-field  approximation  {l-h\  ~  E[lj;1]  = 
Pj4  and  lj/2  ~  pj  2,  where  E[X]  is  the  expected  value  of  the  random  variable  X),  we  get 
the  following  dynamical  system: 

=  — 5ipt  i  +  (3i(l  —  pi  i  —  pt2) 

at 

j  '  —  — 5?pi2  +  (32(1  —  pi i  —  pi2) 

at 

(for  each  node  i). 

Fixed  Points:  At  the  steady  state  i.e.  at  fixed  points  where  the  change  in  probabilities 
will  be  zero,  we  get  (for  each  node  i): 


SiPi 

i,i  =  |3i(l-Pu- 

Pu)  Y  (Ajjppi) 

) 

(4.5) 

§2Pi 

1,2  =  (32(1—  Pu  — 

pu)  y  (AtjPju) 

j 

(4.6) 

which  can  be  written  in 

vector-form  as: 

(3iSAPi  = 

6iPi 

(4.7) 

|32SAP2  = 

62P2 

(4.8) 

^(AijPj,i) 

j 

(4.3) 

^(AaPj,2) 

(4.4) 

j 
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where  Pi  =  [pi,i,p2,i, •  •  •  /Pn,i]T,  ?2  =  [pi,2,P2,2,- •  -,Pn,2]t  and  S  =  diag(st)  =  I- 
diag(P!  +  P2). 

In  all  of  the  following  analysis,  we  assume  we  are  operating  at  fixed  point  unless 
stated  otherwise,  i.e.  Equations  4.5  and  4.6  or  equivalently  Equations  4.7  and  4.8  hold. 
Additionally,  we  assume  that  A  is  connected.  First  we  have  the  following  series  of 
lemmas. 

Lemma  4.1.  V  i  we  have  that  St  ^  0. 

Proof.  If  St  =  0  for  any  i,  then  Equations  4.5  and  4.6  immediately  imply  that  pt/1  =  pi/2  =  0 
which  contradicts  St  +  pij  +  pi/2  =  1.  □ 

Lemma  4.2.  If  3  i  p^i  =  0  V  i  py  =  0.  Similarly  3  i  pi/2  =  0  =>■  V  i  pi/2  =  0. 

Proof  If  3  i  pi i  =  0,  then  from  Equation  4.5  we  have  A^p^i)  =  0  (as  Si  ^  0  from 
Lemma  4.1).  Clearly,  Ay's  are  positive  only  for  those  nodes  j  which  are  neighbors  of 
node  i,  i.e.  for  j  e  NF(i)  (and  there  is  at  least  one  such  j  as  the  graph  is  connected).  For 
these  j,  as  pj  j  can  not  be  negative  (they  are  probabilities),  they  have  to  be  zero  so  that 
the  above  is  true.  Now  we  can  apply  the  same  argument  we  applied  for  node  i  in  turn 
for  all  the  neighbors  j  €  NE(i)  and  so  on.  Finally  we  get  that  V  i  pi  i  =  0  as  the  graph  is 
connected.  We  can  prove  similarly  for  pi/2.  □ 

Lemma  4.3.  The  matrix  SA  is  non-negative  and  irreducible. 

Proof.  A  is  symmetric  and  irreducible  as  it  is  connected.  From  Lemma  4.1  we  have  that  S 
is  a  diagonal  positive  matrix.  Clearly,  it  follows  that  S  •  A  maybe  asymmetric  but  it  is  a 
non-negative  and  irreducible  matrix  (intuitively,  multiplying  by  S  preserves  the  original 
edges  in  A).  □ 

Lemma  4.4.  The  matrix  SA  has  a  unique  positive  real  number  (say  A)  as  its  largest  eigenvalue 
(in  magnitude).  Further  the  algebraic  multiplicity  of  A  is  1  and  it  has  a  positive  eigenvector  (say 
v:  then  all  components  ofv  are  positive). 

Proof.  As  SA  is  non-negative  and  irreducible  (Lemma  4.3),  we  can  apply  the  Perron- 
Frobenius  theorem  [McCOO].  The  lemma  follows  directly  then.  □ 

Lemma  4.5.  There  are  no  positive  eigenvectors  of  SA  other  than  v  (the  Perron  eigenvector  of 
SA  corresponding  to  the  largest  eigenvalue). 

Proof.  From  Lemma  4.3,  it  follows  that  (SA)T  is  non-negative  and  irreducible  as  well. 
Moreover,  note  that  the  eigenvalues  of  any  matrix  M  and  MT  are  the  same.  Hence,  again 
applying  the  Perron-Forbenius  theorem  to  (SA)T,  we  have  the  largest  eigenvalue  as  A 
and  the  corresponding  positive  eigenvector  as  say  u.  From  the  eigenvalue  equation,  we 
know  that  uTSA  =  AuT. 
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Now  suppose  we  have  another  positive  eigenvector,  say  w,  corresponding  to  eigen¬ 
value  t  of  SA  (so,  SAw  =  tw).  Then: 

Autw  =  utSAw  =  tuTw 

Hence  (A  —  t)uTw  =  0.  But  uTw  ^  0  as  both  u  and  w  are  positive.  Hence  t  =  A.  But  A 
has  multiplicity  1  (Lemma  4.4)  and  hence  w  =  v.  □ 

Lemma  4.6.  At  fixed  point,  Pi  >  0  and  P?  >  0  both  can  not  hold  unless  =  |L 

Proof.  Together  with  Lemma  4.2,  Equation  4.7  implies  that  either  Pi  =  0  or  it  is  a  positive 
eigenvector  of  SA  with  eigenvalue  Si/ |3i.  Similarly  from  Equation  4.8  (and  Lemma  4.2) 
we  get  that  either  P2  =  0  or  it  is  a  positive  eigenvector  of  SA  with  eigenvalue  S2/|32.  From 
Lemma  4.5,  the  only  positive  eigenvector  of  SA  is  the  one  corresponding  to  the  largest 
eigenvalue.  Hence  both  Pi  >  0  and  P2  >  0  can  hold  only  if  ^  =  |/.  Otherwise  at  least 
one  of  them  is  zero.  □ 

Assuming  the  virus  strengths  are  not  equal,  Lemma  4.6  implies  the  following  theorem: 

Theorem  4.2.  Assuming  the  virus  strengths  are  not  equal,  the  system  has  only  the  following 
possible  fixed  points: 

1.  {Pi  — >  0,  P2  — >  0}  (i.e.  the  viruses  die-out) 

2.  {?i  — y  perron  eigenvector  of  SA,  P2  — »  0}  (i.e.  only  virus  1  survives) 

3.  {P2  — >  perron  eigenvector  of  SA,  Pi  — »  0}  (i.e.  only  virus  2  survives) 

We  can  assert  the  next  lemma  immediately: 

Lemma  4.7.  The  second  and  third  fixed  points  in  Theorem  4.2  require  cq  >  1  and  a2  >  1 
respectively. 

Proof.  In  the  second  fixed  point,  virus  2  dies-out  and  only  virus  1  survives.  Hence  the  sys¬ 
tem  now  is  equivalent  to  a  single  virus  operating  on  the  whole  graph  under  the  standard 
flu-like  SIS  model.  For  this  we  already  know  that  the  virus  should  be  above  the  'epidemic 
threshold'  if  it  has  to  survive  (and  not  die-out  exponentially  quickly)  [CWW  1  08,  PCF+11], 
Hence  A|3i/Si  =  cr,  >  1  is  necessary  for  the  second  fixed  point.  Similarly  we  can  prove 
the  case  for  when  virus  2  survives.  □ 


Stability  Conditions:  We  first  compute  the  Jacobian  at  each  of  the  fixed  points. 

Lemma  4.8.  The  Jacobians  at  the  three  fixed  points  can  be  written  as  belozv.  (each  Jacobian  is  a 
2N  x  2N  matrix,  each  sub-matrix  block  below  is  a  matrix  of  size  N  x  N). 


2.  Ji  = 
2-  J2  = 


|3iA  —  §il  0 
0  [32A  —  62I 

|3iSA  —  §il  —  |3idiag(APi) 

0 


-|3idiag(APi) 
[32SA  —  62I 
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3.  h  = 


piSA-Sil  0 

—  |32diag(AP2)  |32SA  —  52I  —  |32diag(AP2) 


Here,  in  J2/  Px  is  the  Perron  eigenvector  of  SA  with  eigenvalue  61/ |3i  (i.e.  it  satisfies  Equation  4.7 
and  is  non-zero).  Similarly  P2  in  J3. 


Proof.  Can  be  computed  using  standard  differentiation.  Details  omitted  for  brevity.  □ 

Given  the  discussion  before,  we  can  analyze  the  corresponding  conditions  for  the 
fixed  point  to  be  hyperbolic  stable  attractor. 

Lemma  4.9.  The  conditions  for  the  fixed  points  to  be  hyperbolic  and  stable  attractor  are: 

1.  Oi  <  1  and  <Ti  <  1 

2.  Oi  >  cr2 

3.  ct2  >  CTi 


Proof.  We  prove  the  conditions  for  each  fixed  point  separately  below  (we  omit  some 
details  for  brevity): 

1.  The  eigenvalues  of  matrix  Ji  are  simply  the  eigenvalues  of  the  matrices  Mi  = 
(3 1 A  —  61I  and  M2  =  |32A  —  52I.  The  real  part  of  all  the  eigenvalues  of  these  matrices 
will  be  negative  if  the  real  part  of  the  largest  eigenvalue  is  negative  (as  Mi  and  M2 
are  real  and  symmetric,  all  their  eigenvalues  are  real).  Hence  the  conditions  for  this 
are  (3i A/5i  <  1  and  |32A/52  <  l,  where  A  is  the  largest  eigenvalue  of  A. 

2.  We  can  see  that  the  eigenvalues  of  the  matrix  J2  are  either  the  eigenvalues  of  matrix 
Mi  =  (3iSA  —  61I  —  |3idiag(APi)  or  the  eigenvalues  of  the  matrix  M2  =  |32SA  —  S2I. 
The  eigenvalues  of  M2  are  just  the  eigenvalues  of  [32SA  subtracted  by  62.  From 
Lemma  4.4  and  Equation  4.7  we  know  that  under  this  fixed  point,  the  largest 
eigenvalue  of  SA  is  Si/|3i.  This  implies  that  M(ASa)  <  Si/ |3i  for  any  eigenvalue  ASa 
of  SA4.  Thus, 

=  (32^(Asa)  —  S2  <  (326i/(3i  —  S2  <  0 

where  the  last  step  follows  if  |32/62  <  |3 1 /6| .  Hence,  if  cr2  <  C|,  the  real  part  of  all 
the  eigenvalues  of  M2  are  negative. 

Consider  the  matrix  D  =  Mi+M/  =  |3i(SA+AS)—  25il— 2|3idiag(APi).  Matrix  D  is 
clearly  real  and  symmetric  and  so  has  all  real  eigenvalues.  Due  to  Lemma  4.4  we  can 
apply  the  Perron-Frobenius  theorem  to  SA  +  AS  as  well  and  deduce  that  its  largest 
eigenvalue  Ai(SA  +  AS)  is  positive.  Further,  from  matrix  theory  [HJ91,  HW97], 
we  know  that  for  any  real  non-negative  matrix  C,  Ai(C  +  CT)  C  2Ai(C).  Hence 
0  <  Ai(SA  T  AS)  2Ai(SA)  =  25i/(3i.  Again  we  know  from  standard  linear 

4IR(x)  denotes  the  real  part  of  x 
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algebra  [HJ91],  that  Ai(X  +  Y)  ^  Ai(X)  +  Ai(Y)  if  X  and  Y  are  symmetric.  Hence, 

A/D)  /  |31A1(SA  +  AS)-261-2(31A1(diag(AP1)) 

/  2(3161/(31  —  25i  —  2(31A1(diag(AP1)) 

^  ^A/diagtAP/) 

<  0 

as  under  this  fixed  point,  diag(APi)  is  a  diagonal  matrix  with  positive  entries  and 
hence  has  all  positive  eigenvalues.  As  D  has  all  real  eigenvalues,  Ai  (D)  <0  implies 
that  it  has  all  negative  eigenvalues.  The  Lyapunov  theorem  [HS74]  states  that  a 
matrix  C  is  stable  (has  M(Ac)  <  0)  if  CT  +  C  has  all  negative  eigenvalues.  Applying 
it  to  our  case,  we  can  see  that  matrix  D  having  all  negative  eigenvalues  implies  that 
Mi  is  stable  unconditionally  under  this  fixed  point. 

Finally,  as  Mi  and  M2  both  (and  so  J2  as  well)  have  the  real  part  of  their  eigenvalues 
negative  under  the  condition  cr2  <  oi,  the  fixed  point  is  a  hyperbolic  stable  attractor 

if  cr2  <  <h- 

3.  Analogous  to  the  case  of  the  fixed  point  above. 

Proved.  □ 

Lemma  4.9  combined  with  Lemma  4.7  allows  us  to  conclude  the  following: 

Theorem  4.3.  The  corresponding  conditions  for  each  of  the  fixed  points  to  (a)  exist,  and  (b)  have 
stability  (i.e.  be  a  hyperbolic  and  stable  attractor)  are: 

1.  01  <  1  and  cr2  <  1 

(i.e.  both  are  below  threshold) 

2.  01  >  1  and  01  >  o? 

(i.e.  vims  1  is  above  threshold  and  virus  1  strength  is  greater  than  virus  2) 

3.  cr2  >  1  and  <r2  >  0i 

(i.e.  virus  2  is  above  threshold  and  virus  2  strength  is  greater  than  virus  1) 

Combining  Theorem  4.2  and  Theorem  4.3,  we  again  have  a  result  similar  to  the  single 
virus  epidemic  threshold  -  that  viruses  die-out  if  they  are  below  the  individual  epidemic 
threshold  (i.e.  if  (3 A/5  <  1).  Finally,  they  also  imply  our  WTA  result  (Theorem  4.1). 

4.5  Experiments 

We  demonstrate  our  result  using  (a)  simulation  experiments  on  varied  datasets;  and  (b) 
case  studies  using  real  data  in  this  section. 
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4.5.1  Setup 

We  first  briefly  describe  our  experimental  setup  for  the  simulations  as  well  as  the  case 
studies. 

Simulations:  WLOG,  in  our  experiments,  we  assumed  that  the  first  virus  is  the  stronger 
virus.  We  then  considered  the  following  three  cases: 

BELOW  :  1  >  PiA/Si  =  0.6  >  |32A/62  -  0.2 
(both  viruses  below  the  threshold) 

MIXED  :  (31A/61  =  6  >  1  >  |32A/62  =  0.2 

(one  above  and  one  below  the  threshold) 

ABOVE  :  PjA/Si  =  6  >  |32A/62  =  4  >  1 
(both  above  the  threshold) 

We  used  the  following  real-world  and  synthetic  network  datasets  for  the  simulations: 

1.  AS -OREGON:  The  Oregon  AS  router  graph  which  is  a  network  graph  collected  from 
the  Oregon  router  views.  It  contains  15,420  links  among  3,995  AS  peers.  More 
information  can  be  found  from  http  :  /  / topology  .  eecs  .  umich  .  edu/ data  . 
html. 

2.  PORTLAND:  One  of  the  biggest  available  physical  contact  graphs,  representing 
a  synthetic  population  of  the  city  of  Portland,  Oregon,  USA  [NDS07],  and  has 
been  used  in  smallpox  modeling  studies  [EGAK+04],  It  is  a  social-contact  graph 
containing  more  than  31  million  links  (interactions)  among  about  1.6  million  nodes 
(people). 

3.  Clique:  A  fully  connected  clique  of  1000  nodes. 

4.  Barbell:  Two  cliques  of  500  nodes  joined  together  by  weak  edges  of  weight  e  =  0.01 
(see  Section  4.4.3  for  a  description). 

We  implemented  our  competing  viruses  model  SI|  I2S  as  an  event  based  discrete  simu¬ 
lation  in  C++.  We  randomly  infect  30  nodes  for  each  of  the  viruses  at  the  start  of  any 
simulation.  All  simulations  were  run  over  1000  time  steps  and  the  plots  show  averaged 
results  from  100  runs. 

Case-studies:  We  collected  historical  data  for  'web-search  interest'  for  various  competing 
products  from  the  Google-Insights  website5  which  aims  to  'provide  insights  into  broad 
search  patterns'.  This  allows  us  to  use  the  data  as  a  proxy  for  product  sales/ adoption  for 
each  product.  We  used  the  following  pairs  of  rival  products: 

1.  Reddit  and  Digg:  Two  social  news  websites,  where  users  post  links  to  interesting 
memes/news  articles.6 

5www. google . com/ insights/ search/ 

6www . reddit . com,  www . digg . com 
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2.  Facebook  and  Myspace:  Two  social  network  websites,  where  users  add  their  friends 
and  share  posts,  pictures  etc.7 

3.  Blu-ray  and  HD-DVD:  Two  rival  competing  standards  of  high-density  optical 
media. 

The  full  mutual  immunity  model  doesn't  describe  all  the  above  situations  perfectly,  but 
it  is  a  very  good  approximation.  We  understand  that  not  all  of  the  pairs  are  mutually 
exclusive  in  the  strict  sense  e.g.,  people  can  go  and  put  links  on  both  Digg  and  Reddit, 
however,  people  are  unlikely  to  be  part  of  both  communities  as  they  have  to  choose  a 
site  while  sharing  content. 

4.5.2  Simulation  Results 


(a)  BELOW  (Time-plot) 


(b)  MIXED  (Time-plot) 


(c)  ABOVE  (Time-plot) 


(d)  BELOW  (Phase-plot) 


(e)  MIXED  (Phase-plot) 


(f)  ABOVE  (Phase-plot) 


Figure  4.4:  (a-c)  Number  of  infected  vs  time  plots  for  simulations  on  the  AS-OREGON  network 
for  different  scenarios,  (d-f)  Corresponding  Phase  plots  (scatter  plot  of  number  of  infected  nodes 
by  virus  1  (y-axis)  and  number  of  infected  nodes  by  virus  2).  Stable  fixed  points  are  marked  by 
bold  circles,  unstable  by  hollow  circles.  Clearly,  the  stronger  virus  wins  (as  long  as  it  is  above 
threshold)  and  the  weaker  dies-out  completely  as  our  result  predicted. 


Figures  4.4  and  4.5  demonstrate  our  results.  In  short,  the  plots  agree  exactly  with  our 
result,  as  expected. 

Figure  4.4  shows  the  Time-plots  and  Phase-plots  for  simulations  on  the  AS-OREGON 
graph  for  our  three  scenarios  as  discussed  before  in  the  setup.  The  time-plots  show  the 

7www . facebook . com,  www . myspace . com 
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Figure  4.5:  Phase  plots  (scatter  plot  of  number  of  infected  nodes  by  virus  2  and  number  of 
infected  nodes  by  virus  1)  plots  for  simulations  on  (a)  the  PORTLAND  network,  (b)  a  clique  and  (c) 
a  barbell  graph  for  scenario  ABOVE.  Again,  stable  fixed  points  are  marked  by  bold  circles  and 
unstable  fixed  points  by  hollow  circles  (FP3  not  shown  in  (a)  for  sake  of  clarity  of  the  trajectory). 
The  weaker  virus  tries  to  dominate  (note  the  bulge),  but  it  dies-out  completely  and  the  stronger 
virus  wins,  as  our  result  predicted. 


Number  of  nodes  Infected  vs  Time  for  each  of  the  viruses  (red  for  the  weaker  virus,  green 
for  the  stronger  one).  The  phase  plot  is  the  scatter  plot  of  number  of  infected  nodes  by 
the  stronger  virus  on  the  y-axis  and  number  of  infected  nodes  by  the  weaker  virus  on 
the  x-axis.  Thus  a  phase  plot  shows  the  trajectory  of  the  simulation  in  the  2-d  plane.  The 
stable  points  in  each  scenario  are  marked  with  solid  circles. 

In  the  BELOW  case,  we  expect  that  both  of  them  die-out.  This  is  borne  out  by  both 
the  time  and  phase  plots  (Figures  4.4(a)  and  (d)).  Point  FP,  (0, 0)  is  the  only  stable  fixed 
point  in  this  case  and  hence  the  system  converges  to  it  very  quickly  (see  the  phase  plot). 
On  the  other  hand,  when  the  stronger  virus  is  above  threshold  (MIXED)  we  can  see  that 
it  takes-over  and  the  other  virus  dies-out  (Figures  4.4(b)  and  (e)).  In  this  case,  point  FP?  is 
stable  and  attracting  while  FPi  becomes  unstable.  As  a  result,  we  converge  to  the  steady 
state  where  only  the  stronger  survives.  Finally,  in  case  ABOVE,  when  each  could  have 
dominated  in  isolation,  the  stronger  virus  clearly  wins  and  wipes-out  the  weaker  virus 
(Figures  4.4(c)  and  (f)).  Here,  FP2  is  again  stable  while  the  other  fixed  points  are  unstable. 
Moreover,  note  that  the  stronger  virus  reaches  the  same  steady-state  as  in  MIXED.  This 
agrees  with  our  analysis  as  well  (see  Lemma  4.7):  in  both  scenarios,  the  stronger  virus 
will  reach  the  same  fixed-point  as  it  would  have  if  operating  in  isolation,  without  the 
presence  of  a  competitor. 

Similarly,  Figure  4.5  shows  the  phase-plots  for  simulations  on  the  other  graph  datasets 
-  PORTLAND,  Clique  and  Barbell.  For  lack  of  space,  we  just  show  the  plots  for  case  ABOVE. 
As  before,  the  stronger  virus  wins  and  the  weaker  virus  dies-out  completely,  no  matter 
the  network,  in  perfect  agreement  with  our  result  (Theorem  4.1). 
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(d)  Reddit  vs  Digg  (Phase  plot) 


(e)  Facebook  vs  Myspace  (Phase  (f)  Blu-ray  vs  HD-DVD  (Phase 
plot)  plot) 


Figure  4.6:  (a-c)  Real  web-search  interest  vs  time  plots  for  pair  of  competitors  (see  Section  4.5.1 
for  more  details)  (d-f)  Corresponding  Phase  plots.  As  expected  from  our  WTA  result,  note  that 
the  stronger  rival  dominates  and  weaker  product  almost  dies-out. 


4.5.3  Case-Studies  using  Real  Data 


Figure  4.6  shows  the  historical  data  we  collected  from  Google-Insights.  In  short,  they 
provide  corroboration  to  the  WTA  phenomenon  in  real-world  as  well. 

Figures  4.6(a-c)  show  the  web-search  interest  vs  time  for  the  three  pairs  of  competitors 
we  discussed  before  in  the  setup.  Figures  4.6(d-f)  show  the  corresponding  phase-plots 
(the  final  data-point  is  marked  by  a  diamond).  Firstly,  as  it  is  real  data,  due  to  various 
reasons  they  do  show  significant  deviations  over  the  smooth  steady  states  observed 
from  our  models  (e.g.,  the  spikes  in  Figure  4.6(c)  denote  Christmas  shopping  sales). 
Nevertheless,  they  broadly  give  positive  evidence  for  the  WTA  result  e.g.,  in  (a-b)  and 
(d-e),  even  though  Digg  and  MySpace  had  a  head-start  and  even  dominate  for  a  while, 
the  stronger  product  (Reddit  and  Facebook)  eventually  takes-over.  The  phase  plots 
also  show  the  trajectories  in  effect  similar  to  the  ones  found  in  our  simulations.  Clearly, 
in  all  the  plots  we  can  see  that  the  eventual  winner  and  dominant  competitor  (Reddit, 
Facebook,  Blu-ray)  almost  completely  wipes-out  the  weaker  competitor,  just  as  our  result 
predicts. 
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4.6  Discussion 


There  are  several  subtle  points,  that  we  deferred  until  now,  for  clarity  of  exposition. 
Specifically,  here  we  discuss  the  following  issues: 

Question:  Explain  the  counter-examples,  of  'winner  takes  all'.  If  'winner  takes  all',  how 
come  there  are  competing  products  where  the  weaker  one  still  has  a  non-trivial  market 
share,  like  'Windows',  'Mac-os'  (and  'Linux')? 

Answer:  Not  'level-fields';  or  not  enough  time.  There  are  indeed  numerous  cases  where 
two  (or  more)  competing  products  or  ideas,  co-exist.  For  example,  in  the  OS  'wars', 
MS-windows  has  a  large  market  share,  with  mac-OS  having  a  smaller,  but  near-constant 
market  share.  There  are  several  settings  that  could  cause  such  deviations. 

•  One  is  that  we  are  violating  our  assumption  of  'fair-play',  e.g.,  some  nodes  (like 
enthusiastic  Apple™  fans)  exhibit  much  lower  infection  probability  [3,  or  even 
zero  for  one  of  the  viruses.  Thus  by  catering  to  just  that  niche  where  it  is  much 
stronger,  the  competitor  can  survive. 

•  A  second  cause  is  weak  connectivity,  like  a  bar-bell  graph  with  a  narrow  bridge, 
and  not  enough  time  to  reach  steady-state. 

•  A  third  cause  is  viruses  of  near-equal  strength.  We  omit  the  simulation  results 
here,  but  similar-strength  viruses  take  too  long  to  reach  WTA.  This  is  analogous  to 
the  case  of  two  near-equal-strength  tennis  players,  that  need  several  games,  and 
several  tie-breakers,  before  a  winner  emerges. 

Question:  Has  this  WTA  phenomenon  appeared  in  other  settings? 

Answer:  Yes,  with  simulation  results.  In  epidemiology  studies,  WTA  is  referred  to  as 
'competitive  exclusion'  e.g.  see  [CCHL96,  CCHL99,  AA05,  AM82],  However,  they  typically 
did  simulations,  or  they  only  studied  homogenous  or  very  structured  topologies  like 
cliques. 

Question:  How  about  other  propagation  models  (SIR  etc)?  Will  WTA,  then? 

Answer:  We  conjecture  that  the  answer  is  'yes'.  The  full  analysis  for  SIR  (=  life-long 
immunity,  like  mumps)  SIRS  (=  long,  but  not  permanent,  immunity)  and  more,  are  the 
focus  of  our  ongoing  research.  We  conjecture  that  similar  results  may  hold,  too,  extrapo¬ 
lating  from  the  results  of  (Prakash  et  al.  [PCF+11]):  that  work  showed  that,  for  a  single 
virus,  the  epidemic  threshold  has  the  same  form,  for  almost  any  virus  propagation  model. 

Question:  Will  WTA  hold,  under  partial  mutual  immunity? 

Answer:  Future  work  -  no  conjectures.  In  this  work,  we  assume  full  mutual  exclusion, 
that  is  a  given  node  will  have  at  most  one  of  the  two  viruses/products  (iPhone/ Android), 
at  any  given  point  in  time,  but  not  both.  There  are  marketing,  and  biological  settings 
that  a  person  may  have  both  products /viruses.  Will  WTA  hold,  then?  This  seems  like 
a  difficult  question,  and  left  for  future  work.  We  suspect  that  the  answer  will  not  be  a 
simple  'yes'  or  'no'. 
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4.7  Conclusions 


In  summary,  we  tackled  the  setting  of  two  competing  products  (or  viruses  or  ideas  etc.) 
spreading  over  a  network  and  studied  the  problem  of  what  happens  in  the  end  (i.e.  in 
the  steady  state).  In  addition  to  problem  formulation  and  getting  ecological  concepts  to 
web-like  phenomena,  the  main  contributions  of  our  work  are  as  follows: 

1.  WTA  Result  and  Proof:  We  provided  a  theoretical  analysis  of  the  propagation  model 
for  arbitrary  graph  topology,  proving  that  the  'win ner-ta kes-a  1 1 '  i.e.  the  stronger 
virus  dominates  and  wipes-out  the  weaker  virus  (if  it  is  above  threshold).  See 
Theorem  4.1. 

2.  Experiments  and  Case-studies:  We  also  demonstrated  our  result  using  extensive 
simulations  on  real  and  synthetic  networks  showing  that  they  match  exactly  with 
our  predictions.  Moreover,  using  case-studies  of  historical  data  of  competing 
products  (Blu-ray/HD-DVD,  Facebook/MySpace,  Reddit/Digg),  we  provided 
positive  evidence  of  WTA  in  real-life. 
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Chapter  5 

Competing  Viruses:  Co-existence 


In  the  previous  chapter,  we  studied  competing  viruses  spreading  over  a  network  which 
are  mutually  exclusive  and  under  perfect  competition  i.e.  if  a  user  buys  product ' A '  (or 
gets  infected  with  virus  'X'),  she  will  never  buy  product  'B'  (or  virus  'Y').  This  is  not 
always  true:  for  example,  a  user  could  install  and  use  both  Firefox  and  Google  Chrome 
as  browsers.  Similarly,  one  type  of  flu  may  give  partial  immunity  against  some  other 
similar  disease. 

In  the  case  of  full  competition,  we  proved  that  'winner  takes  all/  that  is  the  weaker 
virus/ product  will  become  extinct.  In  the  case  of  no  competition,  both  viruses  survive, 
ignoring  each  other.  So  a  natural  question  is:  what  happens  in-between  these  two 
extremes? 

In  this  chapter,  we  show  that  there  is  a  phase  transition:  if  the  competition  is  harsher 
than  a  critical  level,  then  'winner  takes  all/  otherwise,  the  weaker  virus  survives.  Our 
contributions  are:  (a)  the  problem  definition,  which  is  novel  even  in  epidemiology 
literature  [AM91,  HetOO,  Ste09]  (b)  the  phase-transition  result  and  (c)  experiments  on 
real  data,  illustrating  the  suitability  of  our  results. 


5.1  Introduction 

Given  two  partially  competing  products  (like  Firefox  and  Google  Chrome;  or  Android 
and  iPhone),  is  it  possible  that  they  both  survive? 

The  well-known  Competitive  Exclusion  Principle  in  ecology  states  that  when  two 
species  are  in  complete  competition  under  constant  conditions,  the  more  fit  one  will 
eventually  drive  the  less  fit  one  into  extinction.  A  more  common  but  less  well  understood 
scenario  is  one  where  the  competing  species  induce  partial  immunity  against  one  another. 
There  has  been  significant  work  trying  to  elucidate  the  conditions  under  which  such 
partial  immunity  leads  to  coexistence  [LCC+09,  CCF 1 10,  LSN96]  but  a  complete  theory 
has  not  yet  emerged. 

Here,  we  study  the  general  case  of  two  virus  strains  with  partial  (and  symmetric) 
cross-immunity  spreading  over  a  fixed  network  topology.  In  addition  to  the  implications 
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(a)  Hulu  and  Blockbuster  (b)  Firefox  and  Chrome 

Figure  5.1:  Plots  of  real  web-search  interest  vs.  time  for  pairs  of  competitors  with  our  model 
fitted  to  the  data. 


for  the  evolutionary  problem  discussed  above,  our  results  have  direct  relevance  to  the 
spread  of  rumors  and  opinions  in  social  networks  and  market  penetration  of  products. 

The  contributions  of  this  work  are  the  following: 

•  the  discovery  that  there  is  a  phase  transition,  that  is,  the  weaker  virus/ product  may 
survive,  if  the  cross-immunity  satisfies  a  threshold  condition.  This  seemed  to  be  an 
open  problem  even  within  the  epidemiology  community  [LCC+09] 

•  experiments  on  real  data,  showing  that  our  model  fits  well 

Figure  5.1  shows  the  time-plots  for  partially  competing  products  Hulu  vs.  Blockbuster 
(a),  and  Google  Chrome  vs.  Firefox  (b).  They  plot  (normalized)  count  of  Google  queries, 
versus  time.  We  fit  our  model  to  the  data1  and  plot  it  as  well.  Notice  that  it  captures  the 
trends  well. 

The  rest  of  the  chapter  is  organized  in  the  usual  way:  we  review  related  work  in  §  5.2 
and  formulate  the  problem  giving  details  of  our  model  in  §  5.3.  We  give  the  analysis  and 
proof  of  our  phase- transition  and  coexistence  result  in  §  5.4  and  demonstrate  the  validity 
of  the  results  using  simulations  and  real-world  case-studies  in  §  5.5.  Finally,  we  discuss 
other  subtle  aspects  of  the  model  in  §  5.6  and  conclude  in  §  5.7. 


5.2  Related  Work 

We  give  a  very  brief  summary  of  the  related  work.  For  much  of  the  prior  work  in  this 
area  please  see  the  related  work  in  the  previous  chapter.  As  mentioned  before,  most 
works  deal  with  single  viruses.  Our  previous  chapter  looked  at  a  two-virus  SIS  model  on 
arbitrary  graphs,  but  focused  on  the  case  where  there  was  full  mutual  immunity  between 
viruses.  The  main  result  says  that,  in  such  a  setting,  the  stronger  virus  will  push  the 
weaker  one  to  extinction  ('winner  takes  all'),  even  if  the  weaker  one  would  be  able  to 
survive  on  the  network  when  left  alone. 

■'■Fitted  with  www.  alexbeutel .  com/  jsplot/kdd2012  .  html 
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Partial  immunity  models  have  received  much  attention  in  epidemiology.  For  example, 
[LCC+09]  suggests  a  differential  equation  based  model  and  analyzes  it  via  simulation. 
However,  for  this  and  most  other  models  of  interest,  a  complete  analytical  solution  has 
been  beyond  reach. 

Distinguishing  features  of  current  work:  In  short,  none  of  the  previous  work  fulfills 
all  the  conditions  of  this  current  work:  (a)  analytical  proof  of  ecriticai,  the  critical  value  of 
the  competition  threshold  (b)  closed-form  steady-state  behavior  (c)  under  an  SIS  (flu-like) 
model. 


5.3  Problem  Formulation 

In  this  section,  we  formulate  our  problem,  giving  details  about  the  model  used  and  the 
assumptions.  Table  5.1  explains  the  terminology  we  have  used  in  the  chapter.  Bold  letters 
typically  denote  matrices  (A,  M  etc.). 


Table  5.1:  Symbols  and  Definitions 


Symbol 

Definition  and  Description 

SI112S 

our  competing  viruses  model 

Pi(or  |32) 

attack  rate  of  virus  1  (or  virus  2) 

Si  (or  62) 

cure  rate  of  virus  1  (or  virus  2) 

e 

Interaction  factor  between  virus  1  and  2 

A 

adjacency  matrix  of  the  underlying  graph 

Ai(M) 

largest  eigenvalue  of  matrix  M 

A 

Ai(A) 

Cl 

A|3i/§i  (strength  of  virus  1) 

c2 

A(32/S2  (strength  of  virus  2) 

Ii  (or  I2) 

The  number  of  nodes  infected  with  only  virus  1  (or  only  virus  2) 

1 1,2 

The  number  of  nodes  infected  with  both  virus  1  and  virus  2 

Kl 

Fraction  of  nodes  infected  with  virus  1  ((h  +  Ii^j/N) 

k2 

Fraction  of  nodes  infected  with  virus  2  ((I2  +  h/2) /N) 

Vl2 

Fraction  of  nodes  infected  with  both  viruses  (Ii/2/N) 

ii  (or  i2) 

Fraction  of  nodes  infected  with  virus  1:  h/N  (or  virus  2:  I2/N) 

k r*  k r*  i* 

Nl/  l12 

The  solution  at  the  coexistence  equilibrium  (if  exists) 

5.3.1  The  propagation  model 

We  assume  that  the  competing  viruses  are  spreading  on  the  network  according  to  a 
propagation  model,  which  we  describe  next.  We  call  our  propagation  model  SI112S,  based 
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Figure  5.2:  State  Diagram  for  a  node  in  the  graph  under  our  partial-competition  model. 


on  the  popular  "flu-like"  SIS  (Susceptible-Infected-Susceptible)  model  [HetOO].  SIi|2S 
denotes  Susceptible  -  Infectedi  or  2  -  Susceptible.  Each  node  in  the  graph  can  be  in  one  of 
four  states:  Susceptible  (healthy),  h  (infected  by  virus  1),  I2  (infected  by  virus  2),  or  I1/2 
(infected  by  both  virus  1  and  virus  2).  The  state  transition  diagram  as  seen  from  a  node 
in  the  network  is  shown  in  Figure  5.2.  We  could  have  extended  other  single  virus  models 
as  well,  but  we  believe  that  our  model  is  a  reasonable  starting  point,  and  we  leave  the 
analysis  of  other  models  as  future  work. 

Healing  (virus  death)  rate:  5.  If  a  node  is  in  an  infected  state  (h,  I2,  or  Ii/2),  it  recovers 
on  its  own  with  some  rate,  6i  for  virus  1  or  52  for  virus  2.  The  healing  rate  is  inversely 
related  with  the  virus's  strength:  a  high  6  means  that  nodes  that  are  infected  heal  quickly. 
For  example,  a  product  that  lasts  a  long  time  such  that  people  using  it  rarely  consider 
alternatives  would  be  modeled  with  a  low  5  value. 

Attack  (virus  transmission)  rate:  (3.  An  infected  node  can  spread  the  virus  to  its 
neighboring  nodes,  and  the  node  susceptibility  is  captured  by  |3i  and  |32.  Specifically,  an 
infected  node  transmits  its  infection  to  each  of  its  healthy  neighbors  independently  at  rate 
(3i  (or  |32).  The  more  often  an  idea  or  product  is  shared  with  friends,  frequently  referred 
to  as  being  "viral,"  the  higher  the  value  for  (3. 

Virus  interaction  factor:  e.  A  node  infected  with  one  virus  may  be  more  or  less 
susceptible  to  being  infected  by  the  other  virus,  as  determined  by  the  factor  e.  The 
transmission  rate  for  a  virus  becomes  e|3i  (or  e|32)  when  a  node  is  already  infected  with 
one  virus.  Specifically,  if  a  node  is  infected  with  virus  1,  each  of  its  neighbors  infected 
with  virus  2  have  a  transmission  rate  to  it  of  e  (3  2;  a  node  infected  with  virus  2  can  only 
be  infected  with  virus  1  at  a  rate  of  e[3i. 

This  is  a  novel  generalization  of  the  single-virus  SIS  model  to  a  multiple-virus  scenario. 
The  value  of  e  can  describe  many  different  virus  interactions.  If  e  =  0  then  the  viruses 
are  fully  mutually  immune,  and  0  <  e  ^  1  suggests  an  amount  of  competition  between 
viruses. 
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Fair-play:  We  assume  that  the  competitors  are  playing  a  'fair  game':  All  nodes  in  the 
network  have  the  same  model  parameters  (|3's,  6's,  e)  for  each  of  the  viruses  and  behave 
according  to  the  state-diagram  in  Figure  5.2. 

5.3.2  Problem  Statement 

We  are  now  in  a  position  to  state  the  problem  formally.  We  assume  the  underlying 
network  is  connected  -  otherwise  we  just  have  separate  disconnected  problems. 

Interacting  viruses  problem 

Given:  An  undirected  connected  graph  G,  and  the  propagation  model  (SI^S)  parameters 
(|3i,  61  for  virus  1,  (32,  62  for  virus  2,  and  e) 

Find:  What  are  the  possible  fixed  points  for  the  system?  In  particular,  for  what  values  of 
e  is  there  a  fixed  point  for  which  both  virus  1  and  virus  2  survive? 


5.3.3  Model  Formulation  for  a  Clique 


For  a  clique,  the  following  differential  equations  fully  describe  the  transitions  of  the 
system,  seen  in  Figure  5.2.  Here  l\,  h,  and  I]^  are  the  number  of  nodes  infected  with 
only  virus  1,  only  virus  2,  and  both  virus  1  and  2  respectively.  N  is  the  total  number  of 


dh 

dt 

dl2 

dt 

dlig 

dt 


N  -  h  -  I2  -  I12). 

+  I12) 

(5.1) 

e  (3i  I2  ( Ii  +  I12) 

(5.2) 

12)  — (61  +  62)112 

(5.3) 

5.4  Results  and  Proofs 

The  goal  of  our  analysis  is  to  find  for  what  values  of  e  is  there  an  equilibrium  point 
for  which  both  virus  1  and  virus  2  survive.  We  find  that  there  is  an  ecriticai  such  that  if 
e  >  £  critical  then  an  equilibrium  point  for  which  the  viruses  coexist. 

5.4.1  Formulating  the  problem 

At  an  equilibrium  point,  all  derivatives  are  zero.  Thus,  we  can  find  a  simple  equation  for 

I12 


e  ( (3i  +  (32)  Ii  I2  —  (61  +  62  —  e  ( (3 1 12  +  1^2  Ii ) )  I12 
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Lemma  5.1.  The  number  of  people  infected  by  both  virus  1  and  virus  2  will  obey  the  following 
equation: 


1 12  =  II  12^  (  (3l  +  P2)/ (5i  +  §2  —  e(Pll2  +  P2I1 ) ) 

Proof  Trivial,  given  the  above.  □ 

Thus  we  have  the  expected  three  equilibrium  points 

•  h  =  I2  =  I12  =  0 

•  h  =  I12  =  0,  l2  =  N  —  g 

•  I2  =  I12  =  0,  h  =  N  — 

and  possibly  one  for  which  h,  I2  >  0  and  obeys  the  differential  equations  outlined: 


0  =  31S(I1  +  I12)  +  62I12  -  6ih  -  e(32Ii(I2  +  I12)  (5.4) 

0  =  P2S  (I2  +  I12)  +  S1I12  —  S2I2  —  ^  (3i  I2  ( Ii  +  I12)  (5.5) 

0  =  e  (3i  I2  ( Ii  +  I12)  +  e(32Ii(I2  +  I12)  —  (Si  +  62)112  (5.6) 

We  rework  these  equations  to  be  primarily  in  terms  of  Kq,  k2/  i.12,  where  Ki  =  (h  + 
Il2)/N/  k2  =  (I2  +  Il2)/hl,  il2  =  I12/N.  As  such,  each  of  these  terms  represent  a  fraction  of 
the  population  that  is  infected.  We  first  convert  the  constraints  to 

NkiPJI  —  Kq  -  (1  —  e)i2]  =  61  Kj  (5.7) 

Nk2(32[1  —  k2  —  (1  —  e)ii]  =  62k2  (5.8) 

eN((3iKii2  +  |32K2ii)  =  (61  +  62)ii2  (5.9) 


where  ii  =  Ii/N  and  12  =  I2/N. 

Manipulating  (5.9)  to  remove  ii  and  i2,  we  find 

ckik2[o'i61  +  0-262)  =  ii2  [6i  +  62  +  euiSiKi  +  ea262K2]  (5.10) 

Remember,  because  we  are  working  with  a  clique  the  virus  strengths  are  ai  =  N  |3i  /  61 
and  o?  =  N  (32/62- 


5.4.2  Results 


From  these  constraints,  we  look  to  find  a  lower  bound  on  e,  such  that  for  any  less 
competition  there  can  be  coexistence. 

Theorem  5.1  (Epsilon  Threshold  Theorem).  Given  a  fully  connected  graph  with  the  SIi|2S 
model  parameters  01  A  cr2,  an  equilibrium  point  for  which  Klr  k2  >  0  exists  if  e  >  ecriticai,  where 


£  critical 


°i  — °2 

M  o-i-i) 
2(i+yi-g1o-2) 
Ol  o2 


if  Ci  +  a2  ^  2 
if  c1  +  <j2  <  2 


(5.11) 
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Proof.  In  Lemma  5.3  we  give  the  possible  fixed  point  for  coexistence.  In  Lemma  5.4  we 
show  the  constraints  for  the  fixed  points  to  be  real,  which  contribute  to  the  bounds  in 
(5.11).  In  Lemmas  5.5  through  5.9  we  give  the  proofs  for  the  constraints  on  the  fixed 
points  being  positive,  and  in  Lemma  5.10  we  give  the  proof  that  the  fixed  points  are  less 
than  one.  □ 

Next  we  describe  all  of  the  Lemmas,  which  contribute  to  the  proof. 

Lemma  5.2.  If  a  fourth  equilibrium  point  exists,  then  it  should  satisfy  the  follow  equation: 

£(k2  -  «i)  =  l/cp  —  I/02  (5.12) 

Proof.  Since  we  are  only  looking  for  non-zero  solutions  for  Ki  and  k2,  we  can  eliminate 
them  in  (5.7)  and  (5.8). 


1  -  Ki  —  (1  —  e)i2  =  1/ffi 
1  -  k2  -  (1  -  e)U  =  l/ff2 

Subtracting,  we  get  the  lemma. 

Lemma  5.3  (Coexistence  Lemma).  If  an  equilibrium 

point  exists  for  which  both  viruses  coexist  in  the  network,  Klr  k2  >  0,  it  will  be  at: 


(5.13) 

(5.14) 

□ 


ii2  =  £KiK2 


Ki  =  K2  +  - - 

~  <J2  Oi 


cqSi  +  a252 


6 1  -f-  S2  3“  eo'jS^Kj  +  ea262K2 


k2 


— 2ea1o'2  +  e2criCT2  ±  e^cqoy  a/4— 4e+e2oi(72 
2e2o’1al 


(5.15) 

(5.16) 

(5.17) 


We  will  denote  the  solution  to  these  three  equations  for  fixed-points  as  if,  k\,  and  k|  respectively. 


Proof.  Equation  (5.15)  is  a  simple  rearrangement  of  equation  (5.10),  and  equation  (5.16) 
is  a  rearrangement  of  equation  (5.12).  Plugging  (5.15)  and  (5.16)  into  (5.13)  allows  us  to 
solve  for  k?  resulting  in  (5.17). 

For  Kj  (and  by  extension  k/  and  if)  to  be  a  valid  fixed-point,  k|  must  be:  (a)  real,  (b) 
l<2  ^  0,  (C)  K2  <  1. 

Lemma  5.4.  In  order  for  fixed-point  solution  k|,  and  by  extension  k(  and  if,  to  be  real  valued, 
either  O10?  >  1  or 


2(1  -  y/1  —  O1O2) 

e  <  -  or 

<X\&2 


e  > 


2(1  +  \Jl  —  CTjU?) 


0*2 
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Proof.  This  constraint  comes  from  the  square  root  in  equation  (5.17)  for  k|.  We  analyze 
the  quadratic  equation  4  —  4e  +  e2ci|  cr2  (in  terms  of  e)  from  inside  the  square  root.  It  is  a 
simple,  upward-facing  parabola.  Solving  for  the  roots  of  the  quadratic  equation  in  terms 
of  e  we  find 

2(1  ±  a/1  —  o-i02) 

e  =  - -. 

0102 

For  01O2  >  1  there  is  no  solution  because  the  equation  is  positive  for  all  values  of  e. 
Thus,  if  0102  >  1  then  k|  must  be  real  valued.  For  C]  a2  <  1  a  portion  of  the  parabola  is 
negative.  Therefore,  we  require  that  e  be  in  the  positive  region  of  the  quadratic  equation, 
where  e  is  less  than  the  lower  root  or  greater  than  the  upper  root.  □ 

To  find  when  ^  0,  we  consider  the  cases  above  for  which  it  is  real.  As  we  explained 
before,  we  will  focus  on  the  lower  bound  for  e. 

Lemma  5.5.  For  strengths  a \<j2  >  1,  fixed-point  K2  is  monotonically  increasing  as  a  function  of 
e. 


Proof  Taking  the  derivative  of  (5.17)  we  get 

±(— 2  +  e)v/d^  +  v/dTv/4  -  4e  +  e2cxio-2 
e2v/cqcr2V4  —  4e  +  c2CTiCT2 

Because  cricr2  >  1/  all  of  the  square  roots  are  real  valued.  The  denominator  is  clearly  posi¬ 
tive,  so  to  prove  that  is  monotonically  increasing,  we  must  show  that  the  numerator  is 
positive.  To  show  that  the  numerator  is  always  positive  we  would  like  to  show  that 

±(-2+  e)y/o^2  <  x/cqV 4-4e  +  e2cricr2 


or  alternatively 


ffi  4  —  4e  +  e2cricr2 
cr2  4  —  4e  +  e2 


Because  Oi  ^  cr2  the  first  term  is  clearly  >  1.  For  oio?  >1  (and  of  course  e  >  0)  this  is 
trivially  true.  □ 


Lemma  5.6.  Fixed-point  solution  k2  ,  defined  by 

— 2eoicr2  +  e2oio-2— e  Ad/ ex, /2  a/4— 4e+e2fficr2 
K2  =  2e2cr1cr2  ' 

can  only  be  positive  when  k/,  defined  by 

+  — 2eciCT2  +  e2ffia2+ev/d/cT2//2v/4^4e+e2cqo/ 

<2  2e2<J\o\ 


(5.18) 


(5.19) 


is  positive. 
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Proof.  As  a  simple  case,  for  CTicr2  >  1,  k2  <  0  and  thus  invalid  for  all  e  >  0.  As  e 
approaches  0,  it  is  clear  that  k2‘  — y  —  oo,  and  as  e  — *  oo,  we  see  that  k2  approaches  0. 
Since  from  the  previous  lemma  we  know  that  it  is  monotonically  increasing,  kJ'  <  0  for 

CTiCT?  >  1. 

If  we  do  not  restrict  CTi  and  cr2,  it  is  still  clear  that  k2  <  k/  for  all  e  ^  0,  since  the  last 
term  is  always  positive.  We  will  see  later  that  k+  <  1  for  all  e  >0.  Therefore,  the  range 
for  which  k/  is  valid  is  a  strict  subset  of  that  for  which  k/  is  valid.  □ 

Because  k2  is  only  valid  when  k2  is  valid,  it  has  no  impact  on  the  phase  transition 
claimed  in  Theorem  5.1.  As  a  result,  we  will  focus  on  k2  for  the  remainder  of  the  proof 
and,  with  a  slight  abuse  of  notation,  use  k|  to  denote  k/. 

Lemma  5.7.  When  strengths  oiCT2  ^  1,  the  fixed-point  for  the  population  infected  by  virus  2  is 
positive,  k2  >  0,  if  and  only  if 


oi  —  ct2 

e  >  — 7 - TT- 

Proof  Solving  equation  (5.19)  for  k2  =  0  produces  e  =  Because  k2  is  monotoni¬ 

cally  increasing  in  this  region  (cri  cr2  >  1),  for  all  e  greater  than  this  solution,  k2  >  0,  and 
for  all  e  less  than  this  solution  k|  ^  0.  □ 

Lemma  5.8.  If  virus  strengths  cq  +  cr2  <  2,  then  the  fixed-point  for  the  popidation  infected  by 
virus  2  is  positive,  k2  >  0,/or 


2(1  +  y/l  —  CTl  cr2) 

e  > - -. 

o']  cr2 

Proof  For  k2  to  be  positive,  the  numerator  of  (5.19)  must  be  positive.  We  can  reduce  this 
as  follows: 


—  2e(x1cr2  +  ecqcr,  +  ev/cqa2/2A/4  —  4e  +  ehqcq 
=  e02(A/a1a2\/4  —  4e  +  e2o-1a2  —  2oi  +  euicr?) 

^  eu2(0  -  2 ffi  +  2(1  +  v7!  -  0W2)) 

=  2c<J2(  — CTl  +  1  +  a/1  -  CTi ct2) 

For  this  to  be  positive  we  must  have  /I  —  Oicr2  >  cri  —  1,  which  is  true  for  cri  +  cr2  <  2.  □ 

Lemma  5.9.  If  virus  strengths  cri  +  cr2  ^  2  then  the  fixed-point  for  the  population  infected  by 
virus  2  is  positive,  k2  >  0 ,  for 

CTi  —  cr2 

e  >  — 7 - TT- 

ct2(cti  -  1) 
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Proof.  Again,  for  k|  to  be  positive,  the  numerator  of  (5.19)  must  be  positive.  We  can 
reduce  this  as  follows: 


—  2eo‘1o'2  +  eoiof  +  ~  4e  +  e2fficr2 

=  eff2(x/cr1(T2  a/4  —  4e  +  e2<T|  cr2  —  2oi  +  ecr1CT2) 


^  ecr2  1 


ffi(— 2+cri  +  cr?)2 
cr2(-l  +  cp)2 


-20-i  +  ai^ 


-a2 


V  cr2  ( 1  —  0-1  ] 


ecricr2 


2  —  Ci  —  a2  ^  ^  °"i  —  °"2 


1  —  Ui 


1  -  oi 


=  0 


□ 


Lemma  5.10.  The  fixed-point  for  the  popxdation  infected  by  vims  2  is  valid,  k2  ^  1,/or  Oi  ^  cr? 
and  e  ^  0. 


Proof.  The  constraint  k2  ^  1  is  equivalent  to 


— 2eoicr2  —  e2<J\o\  +  \/ T  —  4e  +  e2criff2  < 


This  can  be  simplified  as  follows: 


^aicr2\/4  —  4e  +  e2oi a2  <  2oi  +  eaicr2 

^0-9(4  —  4e  +  e2(Tio-2)  <  4of  +  e2of  of  +  4eofcr2 
ffia2  —  eaiu2  <  a2  +  erf cr2 
<y2  1  —  e 


ffi  1  +  ea2 


<  1 


(5.20) 

(5.21) 

(5.22) 

(5.23) 


The  simplification  to  (5.23)  makes  it  clear  that  the  lemma  is  true  for  ffi  ^  cr2  >  0.  □ 


As  such,  for  any  interaction  factor  e  >  ecritiCai,  we  have  proved  that  k(  and  k|  are  valid 
equilibrium  points  for  which  the  population  infected  by  each  virus  K\,  k2  >  0.  □ 


□ 


5.5  Experiments 

We  demonstrate  our  result  using  (a)  simulation  experiments  and  (b)  case  studies  using 
real  data  in  this  section. 
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Time 

(b)  e  =  0.4  >  ecriticai  =  1,  <Ji  =  6,  a2  =  4 


Figure  5.3:  Coexistence  is  possible:  Results  from  simulations  on  clique  of  size  1000  with  theoretical 
fixed  points  overlayed.  (a)  shows  the  steady-state  population  values,  K\,  <2,  and  ii2,  for  each 
value  of  e,  with  the  theoretical  ecrjticai  marked,  (b)  shows  the  development  of  the  two  viruses 
over  time  for  e  >  ecriticai-  Notice  that  both  viruses  survive  as  expected. 


5.5.1  Setup 

Without  loss  of  generality,  in  our  experiments  we  assumed  that  the  first  virus  is  the 
stronger  virus.  We  primarily  focus  on  the  case  where  Oi  >  02  >  1.  F°r  our  simulations 
we  use  cri  =  6  and  <j2  =  4. 

We  run  a  simulation  on  a  fully-connected  clique  of  1000  nodes.  We  vary  e  around 
our  expected  threshold  and  for  each  value  of  e  perform  10  runs  over  4000  time  steps.  On 
each  run  we  begin  by  infecting  30  nodes  at  random  with  each  virus. 

We  analyze  the  results  in  two  ways.  First,  we  create  a  steady-state  plot  of  mean  values 
and  standard  deviations  for  «i,  k2,  and  i.12  at  steady-state  over  a  range  of  values  for  e. 
Over  the  results  of  the  simulation  we  draw  the  behavior  predicted  by  our  results.  Second, 
for  one  e  >  ecriticai  we  track  each  virus's  development  over  time  with  a  time-plot.  The 
time-plot  takes  the  average  number  of  nodes  infected  («i,  k2,  and  i.12)  at  each  time  step 
and  plots  this  against  time.  Although  the  simulations  were  run  for  4000  time  steps,  the 
plots  are  truncated  to  give  more  detail  to  the  initial  fluctuations  of  the  virus  counts. 


5.5.2  Simulation  Results 

Figure  5.3  displays  our  results.  In  short,  the  plots  agree  exactly  with  our  result,  as 
expected.  Figure  5.3(a)  shows  the  steady-state  plot  for  the  two  viruses,  and  the  theoretical 
predictions  closely  match  the  simulation  results.  Similarly,  the  viruses'  growth  as  shown 
in  the  time-plot  in  Figure  5.3(b)  matches  what  is  expected. 

For  the  steady-state  plot,  we  expect  the  steady-state  value  to  be  one  of  the  other 
fixed  points  where  at  least  one  virus  dies  out  for  e  ^  ecritiCai  and  then  a  co-existence  for 
e  >  ecriticai.  In  Figure  5.3(a)  we  see  for  cri  >  cr2  >  1  and  e  =  0,  winner  takes  all,  as  was 
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proven  in  [PBRF12],  However,  for  e  >  ecriticai  we  see  a  coexistence  between  the  viruses  as 
expected;  this  is  true  even  when  the  viruses  are  competing  (e  <  1).  For  e  =  0.4  >  ecriticai 
we  have  the  time-plot.  Figure  5.3(b),  showing  the  growth  of  both  viruses  to  steady-state 
from  a  small  infection  in  the  system. 


5.5.3  Case-Studies  using  Real  Data 


(a)  Hulu  and  Blockbuster 


(b)  Firefox  and  Chrome 


Time 


(c)  Firefox  and  Chrome  (Prediction) 


Figure  5.4:  Real  web-search  interest  vs.  time  plots  for  pairs  of  competitors  with  our  model  fitted 
to  it.  (c)  Predicts  steady  state  values  based  on  our  model.  Data  acquired  from  Google  Trends. 


We  collected  historical  data  for  'web-search  interest'  for  various  competing  products 
from  the  Google-Insights  website2,  which  aims  to  "provide  insights  into  broad  search 
patterns."  This  allows  us  to  use  the  data  as  a  proxy  for  sales /interest  for  each  product. 
We  used  the  following  pairs  of  rival  products: 


1.  Hulu3  and  Blockbuster4:  Although  not  direct  competitors,  both  offer  video  enter¬ 
tainment  services,  though  under  very  different  models. 


2www. google . com/ insights/ search/ 

3www . hulu . com 

4www . blockbuster . com 
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2.  Firefox5  and  Google  Chrome6:  Two  rival  web  browsers. 

We  consider  both  pairs  of  products  to  be  examples  of  cases  where  there  is  partial  mutual 
immunity;  people  can  use  both  products,  but  the  use  of  one  we  expect  would  detract 
from  the  use  of  the  other.  While  our  model  does  not  describe  the  situations  perfectly,  we 
believe  it  is  a  good  approximation. 

In  Figure  5.4  we  show  plots  of  the  web-interest  vs.  time  for  both  pairs  of  products, 
along  with  our  model  fitted  to  the  data7.  In  Figure  5.4(a),  we  used  a  virus  interaction 
factor  of  e  =  0.7  (along  with  virus  parameters  5Huiu  =  0.04,  [3hu1u  =  0.0007,  5 Blockbuster  = 
0.05,  (3 Blockbuster  =  0.00045).  In  Figure  5.4(b),  we  used  a  virus  interaction  factor  of  e  =  0.6 
(along  with  virus  parameters  SFirefox  =  0.01,  (3Firefox  =  0.000095,  6Chrome  =  0.01,  |3Chrome  = 
0.00015).  In  Figure  5.4(c)  we  use  the  same  model  as  (b)  but  let  the  model  continue  to  see 
the  projected  steady  state  behavior.  We  note  that  the  plots  begin  when  Hulu  and  Chrome 
are  first  introduced  and  with  Blockbuster  and  Firefox  at  a  previous  steady-state  behavior. 
In  each  of  these  fittings  we  see  that  our  model  fits  the  data  well.  The  fact  that  the  model 
fits  the  data  well  demonstrates  the  suitability  of  our  SIi|2S  model. 


5.6  Discussion 

5.6.1  A  general  upper  bound 

Conjecture  5.1  (Epsilon  Threshold  Upper  Bound).  Given  an  arbitrary  graph  with  the  SIi|2S 
model  parameters  Oi  ^  cr2  ^  1,  an  equilibrium  point  for  which  both  virus  1  and  virus  2  survive 
exists  if  e  >  ecriticai,  where 


^  critical  ^ 

CT2 


(5.24) 


Justification:  Since  Oi  ^  <r2  ^  1  and  0  <  e  <  1,  we  know  that  both  virus  1  and  virus  2 
would  be  strong  enough  to  survive  independently  but  there  is  some  competition  between 
them.  Because  of  the  competition,  as  virus  1  spreads  to  more  nodes,  virus  2's  attack  rate 
on  average  decreases  and  thus  its  strength  decreases.  Therefore,  if  we  overestimate  the 
strength  or  number  of  people  with  virus  1,  this  only  makes  it  more  difficult  for  virus  2 
survive  and  thus  decreases  the  maximum  amount  of  competition  virus  2  can  handle, 
increasing  ecriticai-  To  simplify  the  problem,  we  assume  that  every  person  is  infected  with 
virus  1  (as  if  virus  1  was  infinitely  strong).  In  this  case,  a  node  can  only  be  in  state  I]  or 
Ii/2,  and  the  probability  of  a  node  with  virus  2  infecting  a  neighbor  is  always  e|32.  This  is 
now  equivalent  to  a  one  virus  model  where  the  strength  of  virus  2  is  cr^  —  e<J2-  Therefore, 
as  shown  in  previous  research  [CWW+08,  GMT05,  PCF+11],  if  >  1  then  virus  2  will 

5 www .mozilla .org/en-US/ fire fox/ new/ 

6www . google . com/ chrome 

7Fitted  with  www .  alexbeutel .  com/  jsplot  /kdd2  012  .  html 
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survive.  In  this  case,  ecriticai  =  However,  because  this  is  a  relaxation  of  the  original 
problem,  we  know  that  this  is  an  upper  bound  and  in  fact  ecritjcai  ^ 

5.6.2  Case-Study:  Qualitative  Analysis 

We  also  consider  the  example  of  educational  ideas,  and  specifically  sex  education,  as  a 
virus.  Sociology  literature  analyzing  the  success  of  sex  education  programs  notes  the 
impact  of  network  effects  and  social  structure  on  sex  education  success  [BB05].  We  match 
education  policy  to  our  S I112S  model  and  analyze  the  implications. 

Policy  1:  Abstinence-Only  Education  Abstinence-only  education  teaches  abstinence 
until  marriage  as  the  only  way  to  live  a  healthy  life,  with  students  often  taking  an  absti¬ 
nence  pledge.  Under  our  model,  virus  1  is  believing  in  abstinence  (through  education  or 
pledge)  and  virus  2  is  sexual  activity.  Therefore,  those  who  are  in  T  have  taken  a  pledge 
of  abstinence,  those  who  are  in  U  are  sexually  active,  and  those  people  who  are  in  S  do 
not  believe  in  abstinence  but  are  not  sexually  active  either.  It  is  obviously  impossible  to 
both  be  following  an  abstinence  pledge  and  to  be  sexually  active  so  nobody  can  be  in 
state  h  2.  Equivalently  in  this  case  there  is  full  mutual  immunity  or  e  =  0. 

Model  1  Predictions  and  Results  Based  on  this  fit,  because  e  =  0,  our  model  predicts 
'winner  takes  all:'  the  weaker  virus  dies  out  and  the  stronger  virus  survives.  Sociology 
research  [BB05],  studying  over  11,000  people  over  7  years,  notes  that  of  the  2399  people 
claiming  to  have  taken  an  abstinence  pledge,  1622  (67%)  over  time  forgot.  This  suggests 
that  cr Abstinence  <  crSexuai  Activity,  and  as  a  result,  in  the  long  run  sexual  activity  will  win  over 
abstinence. 

Policy  2:  Comprehensive  Sex  Education  Comprehensive  sex  education  teaches  numer¬ 
ous  methods  to  have  a  safe,  healthy  sex  life,  discussing  both  contraception  and  abstinence. 
Matching  this  to  our  model,  virus  1  is  being  educated  in  safe-sex  practices  and  valuing 
their  importance  and  virus  2  is  sexual  activity.  Therefore,  those  who  are  in  T  have  been 
educated  about  safe-sex  practices  and  believe  they  are  important  but  are  not  sexually 
active,  those  in  I2  are  sexually  active  but  do  not  practice  safe  sex,  those  in  h/2  practice 
safe-sex,  and  those  in  S  are  neither  educated  on  safe-sex  practices  nor  sexually  active. 
Here  we  expect  little  to  no  competition  between  the  two  viruses  and  thus  have  an  e  value 
close  to,  if  not  equal  to,  1. 

Model  2  Predictions  and  Results  Because  e  is  close  to  1,  we  expect  that  e  >  ecriticai-  As  a 
result,  it  is  possible  for  there  to  be  coexistence  of  the  two  viruses,  such  that  there  can  be  a 
steady-state  in  which  people  are  sexually  active  and  practice  safe-sex.  This  appears  to 
match  sociology  literature  claiming  that  those  who  initially  use  condoms  will  keep  using 
condoms  [BB05]. 

In  summary,  our  model  qualitatively  agrees  with  sociology  research  and  offers  a  plausi¬ 
ble  explanation  for  the  results  of  the  study.  Additionally,  these  two  cases  demonstrate 
the  value  of  a  phase  transition.  In  the  first  case,  the  model  suggests  winner  takes  all  and 
the  ineffectiveness  of  abstinence-only  education.  On  the  contrary,  for  policy  2,  the  model 
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predicts  coexistence,  which  agrees  with  the  findings,  and  is  better  for  society. 

5.6.3  Subtle  Points 

There  are  several  subtle  points,  that  we  deferred  until  now,  for  clarity  of  exposition. 
Specifically,  here  we  discuss  the  following  issues: 

What  does  it  mean  for  e  >  1? 

As  before,  the  virus  2  transmission  rate  for  a  virus  infected  with  virus  1  becomes  e  (32  and 
the  virus  1  transmission  rate  for  a  virus  infected  with  virus  2  becomes  e|3i.  However, 
because  e  >  1  the  transmission  rate  for  each  virus  increases  for  neighbors  that  are  already 
infected.  We  consider  this  to  be  a  form  of  cooperation  between  the  viruses  (products, 
ideas,  etc.). 

This  pattern  of  cooperation  between  products  is  common  in  product  ecosystems. 
An  example  of  this  is  that  people  who  have  an  iPod  are  more  likely  to  buy  music  and 
videos  through  Apple's  iTunes.  Making  use  of  such  cooperation  can  be  seen  in  'freebie 
marketing'  or  the  'razor  and  blades  business  model,'  in  which  the  company  producing 
razor  blades  sells  the  razors  at  an  artificially  low  price  creating  a  market  for  the  blades. 
This  method  of  tightly  integrating  products  is  common  in  a  variety  of  industries. 

What  happens  if  o?  or  °i  A  1? 

Because  cp  A  02  there  are  two  cases  we  can  analyze.  The  first  is  when  o~i  A  1  >  02, 
in  which  the  second  virus  is  too  weak  to  survive  on  its  own.  We  will  refer  to  this  as 
the  "piggyback  setting,"  because  virus  2  can  only  survive  with  the  help  of  the  first. 
The  second  condition  is  when  1  >  cp  ^  cr2,  where  both  viruses  are  independently  too 
weak  to  survive.  We  will  refer  to  this  as  the  "teamwork  setting,"  because  only  through 
cooperation  can  both  viruses  survive.  In  each  of  these  cases.  Theorem  5.1  still  holds  - 
plugging  in  cp  and  cr2  we  find  an  ecriticai  for  which  a  fixed  point  in  which  Klf  k2  >  0  exists. 
This  suggests  that  even  if  both  viruses  are  independently  too  weak  to  survive  on  their 
own,  with  enough  cooperation  they  can. 

To  test  our  theorem,  we  ran  similar  simulations  where  either  one  or  both  viruses  were 
too  weak  to  independently  survive.  In  Figure  5.5(a)  we  show  the  steady-state  plot  for  the 
piggyback  case  of  01  >1  >  cr2.  As  expected,  for  e  =  1  when  the  viruses  are  independent, 
virus  1  survives  but  virus  2  is  not  strong  enough  and  dies  out.  For  a  sufficient  amount  of 
cooperation,  e  >  ecriticai,  we  find  that  virus  2  can  survive  as  well.  We  see  in  Figure  5.5(c), 
the  corresponding  time-plot  where  e  =  3.5  >  eCHticai,  that  once  virus  1  grows,  virus  2  is 
able  to  survive  as  well. 

We  also  simulated  the  teamwork  case,  where  neither  virus  is  independently  strong 
enough  to  survive,  1  >  cp  ^  cr2.  In  Figure  5.5(b)  we  show  the  steady-state  plot  for  this 
case.  Again,  the  theoretical  result  and  predicted  phase-transition  match  the  simulation 
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(a)  ffi  =  3,  cr2  =  0.5 


(b)  oi  =  0.8,  cr2  =  0.6 


J3 


5 


(c)  e  =  3.5,  ffi  =  3,  cr2  =  0.5 


(d)  e  =  8,  0i  =  0.8,  cr2  =  0.6 


Figure  5.5:  With  enough  collaboration,  even  weak  viruses  can  survive.  Results  from  simulations 
on  population  of  N  =  1000  with  theoretical  values  overlayed,  and  1  >  cr2.  The  first  column,  (a) 
and  (c),  is  the  "piggyback"  case  where  just  virus  2  is  too  weak  to  survive.  The  second  column,  (b) 
and  (d),  is  the  "teamwork"  case  where  neither  virus  is  on  its  own  strong  enough  to  survive.  The 
top  row  gives  the  steady-state  plots,  showing  the  steady-state  footprint  vs.  the  interaction  factor 
e.  The  second  row  gives  the  time-plots,  showing  the  infection  footprint  developing  over  time. 


results.  For  e  =  1,  the  two  viruses  are  independent  and,  since  they  cannot  survive 
on  their  own,  die  out.  In  this  case,  the  phase  transition  is  based  on  the  second  part  of 
Theorem  5.1  where  cq  -f  cr2  <  2,  and  as  such  the  bound  is  a  result  on  the  restriction  of  k2 
being  real.  As  such,  at  ecriticai  both  Ki,  k2  >  0  rather  than  equal  to  0.  We  see  here  at  the 
threshold  a  large  amount  of  uncertainty  in  the  simulation  but  as  we  move  away  from  the 
threshold  the  simulation  follows  this  new  fixed  point.  Interestingly,  we  must  initially 
infect  a  large  portion  of  the  graph  for  the  system  to  go  to  this  fixed  point,  and  not  die  out. 
For  e  =  7.75  >  erritira i.  Figure  5.5(d)  shows  the  time-plot,  demonstrating  that  both  viruses 
quickly  reach  steady-state  with  a  high  amount  of  overlap. 


5.7  Conclusions 

We  defined  and  studied  the  problem  of  partial  competition,  where  two  viruses/ products 
provide  partial  immunity  against  each  other. 
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The  main  contributions  of  our  work  are  as  follows: 

1.  Problem  Definition:  The  problem  is  novel,  in  the  data  mining  and  web  mining 
communities,  and  even  in  the  epidemiology  literature  [AM91,  HetOO,  Ste09]. 

2.  Threshold  Result  and  Proof:  We  showed  that  there  is  a  phase  transition:  'winner  takes 
all/  until  the  competition  level  drops  below  a  critical  value.  Above  this  critical 
value  we  find  a  closed-form  steady-state  solution  with  coexistence. 

3.  Experiments  and  Case-studies:  We  showed  results  from  real  settings  (like  browsers  - 
Firefox  vs  Google  Chrome),  which  agree  with  our  model. 


85 


Part  II 
Algorithms 
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Overview 


We  turn  our  attention  to  managing  cascade-like  processes  on  networks.  As  we  will  see, 
these  problems  are  closely  related  to  the  epidemic  threshold  problems  we  addressed  in 
the  previous  part  of  this  thesis.  We  solve  two  major  problems  here: 

•  Effective  and  Efficient  Immunization:  Given  a  large  graph,  like  a  computer  com¬ 
munication  network,  which  k  nodes  should  we  immunize  (or  monitor,  or  remove), 
to  make  it  as  robust  as  possible  against  a  computer  virus  attack?  Which  nodes 
should  we  immunize  to  prevent  an  epidemic,  if  the  virus  is  spreading  over  an 
underlying  network  changing  over  time  (say  day  vs  night  connectivity)?  How  to 
distribute  a  fixed  amount  of  infection-control  resource  (say  antibiotics)  amongst 
hospitals?  We  answer  these  and  several  other  variants  of  the  immunization  prob¬ 
lem.  Our  algorithms  such  as  NetShield  and  Smart-Alloc  achieve  orders  of 
magnitude  improvement  over  current  practice  and  other  heuristics  for  complete 
immunization  (aka  node-removal)  and  fractional  immunization  over  real  hospital 
patient-transfer  graphs,  respectively.  Finally,  we  also  present  effective  algorithms 
when  the  underlying  network  changes  with  time. 

•  Finding  Culprits:  Next,  we  tackle  a  different  and  difficult  question:  Given  a 
single  snapshot  of  a  partly  infected  network,  how  we  can  reliably  identify  those 
nodes  from  which  the  epidemic  started;  whether  for  inoculation  to  prevent  future 
epidemics,  or  for  exploitation  for  viral  marketing.  We  propose  to  employ  the 
Minimum  Description  Length  principle  for  identifying  that  set  of  seed  nodes 
from  which  the  given  snapshot  can  be  described  most  succinctly.  We  introduce 
NETSLEUTH  (based  on  a  novel  'submatrix-laplacian'  method),  a  highly  efficient 
algorithm  for  both  identifying  the  set  of  seed  nodes  that  best  describes  the  given 
situation,  and  automatically  selecting  the  best  number  of  seed  nodes — in  contrast  to 
the  state  of  the  art. 

These  are  very  important  tasks  with  numerous  applications  in  a  variety  of  areas,  like 
public  health  (vaccination  campaigns),  cyber  security  (patching  workstations),  and  online 
social-media  analysis  (detecting  sources  of  rumors). 
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Chapter  6 

Complete  Node-Removal 


Given  a  large  graph,  like  a  computer  communication  network,  which  k  nodes  should 
we  immunize  (or  monitor,  or  remove),  to  make  it  as  robust  as  possible  against  a  com¬ 
puter  virus  attack?  Which  nodes  should  we  immunize  to  prevent  an  epidemic,  if  the 
virus  is  spreading  over  an  underlying  network  changing  over  time  (say  day  vs  night 
connectivity)?  We  need  (a)  a  measure  of  the  'Vulnerability'  of  a  given  network  (or  the 
time-varying  network),  (b)  a  measure  of  the  'Shield-value'  of  a  specific  set  of  k  nodes  and 
(c)  a  fast  algorithm  to  choose  the  best  such  k  nodes. 

It  turns  out  that  this  problem  is  closely  related  to  the  epidemic  threshold  problem  we 
studied  in  the  preceeding  part  of  this  thesis.  Specifically,  we  answer  all  the  above  three 
questions:  we  give  the  justification  behind  our  choices,  we  show  that  they  agree  with 
intuition  as  well  as  our  previous  results  on  the  epidemic  threshold  (see  Chapters  2  and 
3).  Moreover,  we  propose  NETSHIELD,  a  fast  and  scalable  algorithm  which  is  provably 
near-optimal  (within  constant  factor  from  optimal),  when  the  network  is  static.  We  also 
give  experiments  on  large  real  graphs,  where  NetShield  achieves  tremendous  speed 
savings  exceeding  7  orders  of  magnitude,  against  straightforward  competitors.  Finally, 
for  the  dynamic  case,  we  give  efficient  heuristics  and  evaluate  the  effectiveness  of  our 
methods  on  synthetic  and  real  data  like  the  MIT  reality  mining  graphs. 


6.1  Introduction 

Given  a  graph,  we  want  to  quickly  find  the  k  best  nodes  to  immunize  (or,  equivalently, 
remove),  to  make  the  remaining  nodes  to  be  most  robust  to  the  virus  attack.  This  is 
the  core  problem  for  many  applications:  In  a  computer  network  intrusion  setting,  we 
want  the  k  best  nodes  to  defend  (e.g.,  through  expensive  and  extensive  vigilance),  to 
minimize  the  spread  of  malware.  Similarly,  in  a  law-enforcement  setting,  given  a  network 
of  criminals,  we  want  to  neutralize  those  nodes  that  will  maximally  scatter  the  graph. 
Similarly,  which  nodes  should  we  immunize  if  the  underlying  network  varies  with  time 
(motivated  by  the  day-night  pattern  of  human  behavior)? 
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The  problem  boils  down  to  three  questions,  which  we  address  in  this  chapter.  For 
the  static  case  (similar  questions  can  also  be  asked  in  the  dynamic  case  of  time-varying 
graphs.  ): 


Ql.  Vulnerability  measure:  How  to  capture  the  'Vulnerability'  of  the  graph,  in  a  sin¬ 
gle  number?  That  is,  how  likely/ easily  that  a  graph  will  be  infected  by  a  virus. 
Similarly,  how  to  capture  the  'Vulnerability'  of  a  series  of  graphs? 

Q2.  'Shield-value':  How  to  quantify  the  'Shield-value'  of  a  given  set  of  nodes  in  the 
graph,  i.e.,  how  important  are  they  in  terms  of  maintaining  the  'Vulnerability'  of 
the  graph?  Alternatively,  how  much  less  vulnerable  will  be  the  graph  to  the  virus 
attack,  if  those  nodes  are  immunized /removed? 

Q3.  Algorithm:  How  to  quickly  determine  the  k  nodes  that  collectively  exhibit  the  highest 
'Shield-value'  score  on  large,  disk-resident  graphs? 


When  the  graph  does  not  change,  we  start  by  adopting  the  first  eigenvalue  A  of  the 
graph  as  the  'Vulnerability'  measurement  (for  Ql),  which  is  motivated  from  immunology 
(based  on  our  results  in  the  previous  chapters)  and  graph  loop /path  capacity.  Based  on 
that,  we  propose  a  novel  definition  of  the  'Shield-value'  score  Sv(S)  for  a  specific  set  of 
nodes  (for  Q2).  By  carefully  using  the  results  from  the  theory  of  matrix  perturbation,  we 
show  that  the  proposed  'Shield-value'  gives  a  good  approximation  of  the  corresponding 
eigen-drop  (i.e.,  the  decrease  of  the  'Vulnerability'  measurement  if  we  remove/  immunize 
the  set  of  nodes  S  from  the  graph).  Furthermore,  we  show  that  the  proposed  'Shield-value' 
score  is  sub-modular,  which  enables  us  to  develop  a  near-optimal  and  scalable  algorithm 
(NetShield)  to  find  a  set  of  nodes  with  highest  'Shield-value'  score  (for  Q3).  We  conduct 
extensive  experiments  on  several  real  data  sets,  illustrating  the  effectiveness  and  effi¬ 
ciency  of  the  proposed  methods.  Specifically,  our  experiments  show  that  the  proposed 
method  NetShield  (a)  leads  an  effective  immunization  strategy;  (b)  scales  linearly  with 
the  size  of  the  graph;  and  (c)  is  dramatically  faster  than  competitors  (over  7  orders  of 
magnitude ). 

Similarly,  we  choose  the  first  eigenvalue  of  the  system  matrix  As  (see  Chapter  3)  as 
the  quality  metric  in  the  dynamic  case.  We  further  give  efficient  and  effective  heuristics 
to  optimize  it  and  demonstrate  their  efficacy  through  several  experiments. 

The  rest  of  the  chapter  is  organized  as  follows:  Section  6.2  gives  the  related  work. 
Sections  6.3-Section  6.7  deal  with  the  problem  in  the  static  case  (when  the  network  does 
not  change  with  time).  Section  6.3  gives  the  problem  definitions,  while  we  present  the 
'Vulnerability'  measurement  in  Section  6.4.  The  proposed  'Shield-value'  score  is  presented 
in  Section  6.5.  We  address  the  computational  issues  in  Section  6.6.  We  evaluate  the 
proposed  methods  in  Section  6.7.  We  next  tackle  the  immunization  problem  under  the 
time-varying  network  case  in  Section  6.8.  Finally,  Section  6.9  gives  the  conclusions. 
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6.2  Related  Work 


In  this  section,  we  review  the  related  work,  which  can  be  categorized  into  three  parts: 
measuring  the  importance  of  nodes  on  graphs,  immunization,  and  spectral  graph  analy¬ 
sis.  Related  work  on  general  graph  mining  can  be  found  in  earlier  chapters. 

Measuring  Importance  of  Nodes  on  Graphs  In  the  literature,  there  are  a  lot  of  node  im¬ 
portance  measurements,  including  betweeness  centrality,  both  the  one  based  on  the  short¬ 
est  path  [Fre77]  and  the  one  based  on  random  walks  [New05b],  PageRank  [PBMW98], 
HITS  [Kle98],  and  coreness  score  (defined  by  k-core  decomposition)  [MW03].  Other 
remotely  related  works  include  the  abnormality  score  of  a  given  node  [SQCF05],  articu¬ 
lation  points  [HT08],  and  k-vertex  cut  [HT08].  Our  'Shield-value'  score  is  fundamentally 
different  from  these  node  importance  scores,  in  the  sense  that  they  all  aim  to  measure  the 
importance  of  an  individual  node;  whereas  our  'Shield-value'  tries  to  collectively  measure 
the  importance  of  a  set  of  k  nodes.  Despite  the  fact  that  all  these  existing  measures  are 
successful  for  the  goal  they  were  originally  designed  for,  they  are  not  designed  for  the 
purpose  of  immunization.  Therefore,  it  is  not  surprising  that  they  lead  to  sub-optimal  im¬ 
munization  results  (See  figure  6.4).  Moreover,  several  of  these  importance  measurements 
do  not  scale  up  well  for  large  graphs,  being  cubic  or  quadratic  wrt  the  number  of  nodes 
n,  even  if  we  use  approximations  (e.g.,  [MW08]).  In  contrast,  the  proposed  NetShield 
is  linear  wrt  the  number  of  edges  and  the  number  of  nodes  (0(nk2  +  m)).  Another 
remotely  related  work  is  outbreak  detection  [LKG+07]  in  the  sense  that  both  works  aim 
to  select  a  subset  of  "important"  nodes  on  graphs.  However,  the  motivating  applications 
(e.g.,  immunization)  of  this  work  is  different  from  detecting  outbreak  [LKG+07]  (e.g.,  con¬ 
taminants  in  water  distribution  network).  Consequently  we  solve  different  optimization 
problems  (e.g.,  maximize  the  'Shield-value'  in  eq.  (6.2))  in  this  chapter. 

Immunization  Briesemeister  et  al.  [BLP03]  focus  on  immunization  of  power  law 
graphs.  They  focus  on  the  random-wiring  version  (that  is,  standard  preferential  attach¬ 
ment),  versus  the  "highly  clustered"  power  law  graphs.  Their  simulation  experiments  on 
such  synthetic  graphs  show  that  such  graphs  can  be  more  easily  defended  against  viruses, 
while  random-wiring  ones  are  typically  overwhelmed,  despite  identical  immunization 
policies. 

Cohen  et  al.  [CFIbA03]  studied  the  acquaintance  immunization  policy  (see  Section  6.8 
for  a  description  of  this  policy),  and  showed  that  it  is  much  better  than  random,  for  both 
the  SIS  as  well  as  the  SIR  model.  For  power  law  graphs  (with  no  rewiring),  they  also 
derived  formulae  for  the  critical  immunization  fraction,  above  which  the  epidemic  is 
arrested.  Madar  et  al.  [MKC+04]  continued  along  these  lines,  mainly  focusing  on  the  SIR 
model  for  scale-free  graphs.  They  linked  the  problem  to  bond  percolation,  and  derived 
formulae  for  the  effect  of  several  immunization  policies,  showing  that  the  "acquaintance" 
immunization  policy  is  the  best.  Both  works  were  analytical,  without  studying  any  real 
graphs. 
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Hayashi  et  al.  [HMM03]  study  the  case  of  a  growing  network,  and  they  derive 
analytical  formulas  for  such  power  law  networks  (no  rewiring).  They  introduce  the  SHIR 
model  (Susceptible,  Hidden,  Infectious,  Recovered),  to  model  computers  under  e-mail 
virus  attack  and  derive  the  conditions  for  extinction  under  random  and  under  targeted 
immunization,  always  for  power  law  graphs  with  no  rewiring. 

In  short,  none  of  the  above  papers  solves  the  problem  of  optimal  immunization  for 
an  arbitrary,  given  graph  or  a  set  of  arbitrary  time-varying  graphs. 

Spectral  Graph  Analysis  Pioneering  works  in  this  aspect  can  be  traced  back  to  Fiedler 's 
seminal  work  [Fie73].  Representative  follow-up  works  include  [SM97,  NJW01,  ZHD+01, 
DLJ08],  etc.  All  of  these  works  use  the  eigen-vectors  of  the  graph  (or  the  graph  Laplacian) 
to  find  communities  in  the  graph. 


6.3  Problem  Definitions  (Static  Graphs) 

Table  6.1  lists  the  main  symbols  we  use  throughout  the  chapter.  In  this  work,  we  focus 
on  un-directed  un-weighted  graphs.  We  represent  the  graph  by  its  adjacency  matrix. 
Following  standard  notations,  we  use  capital  bold  letters  for  matrices  (e.g..  A),  lower¬ 
case  bold  letters  for  vectors  (e.g.,  a),  and  calligraphic  fonts  for  sets  (e.g.,  S).  We  denote 
the  transpose  with  a  prime  (i.e..  A'  is  the  transpose  of  A),  and  we  use  parenthesized 
superscripts  to  denote  the  corresponding  variable  after  deleting  the  nodes  indexed  by  the 
superscripts.  For  example,  A  is  the  first  eigen-value  of  A,  then  A1  is  the  first  eigen-value  of 
A  after  deleting  its  i(th)  row/ column.  We  use  (At,  u  j  to  denote  the  ith  eigen-pair  (sorted 
by  the  magnitude  of  the  eigenvalue)  of  A.  When  the  subscript  is  omitted,  we  refer  to 
them  as  the  first  eigenvalue  and  eigenvector  respectively  (i.e.,  A  A  Ai  and  u  =  ui). 

With  the  above  notations,  our  problems  can  be  formally  defined  as  follows: 

Problem  6.1.  Measuring  ‘Vainer ability' 

Given:  A  large  un-directed  un-weighted  connected  graph  G  with  adjacency  matrix  A; 

Find:  A  single  number  V(G),  reflecting  the  'Vulnerability'  of  the  whole  graph. 

Problem  6.2.  Measuring  'Shield-value' 

Given:  A  subset  S  with  k  nodes  in  a  large  un-directed  un-weighted  connected  graph  A; 

Find:  A  single  number  Sv{§),  reflecting  the  'Shield- value'  of  these  k  nodes  (that  is,  the  benefit 
of  their  removal/immunization  to  the  vulnerability  of  the  graph). 

Problem  6.3.  Finding  k  Nodes  of  Best  'Shield-value' 

Given:  A  large  un-directed  un-weighted  connected  graph  A  with  n  nodes  and  an  integer  k; 
Find:  A  subset  S  of  k  nodes  with  the  highest  'Shield-value'  score  among  all  (]))  possible  subsets. 

In  the  next  three  sections,  we  present  the  corresponding  solutions  respectively. 
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Table  6.1:  Symbols 


Symbol 

Definition  and  Description 

A,  B, . . . 

matrices  (bold  upper  case) 

A  ( i,  j ) 

the  element  at  the  itH  row  and  jtH 

column  of  matrix  A 

Aft, :) 

the  ith  row  of  matrix  A 

A(:,j) 

the  j 1  h  column  of  matrix  A 

A' 

transpose  of  matrix  A 

a,  b, . . . 

column  vectors 

s,x... 

sets  (calligraphic) 

n 

number  of  nodes  in  the  graph 

m 

number  of  edges  in  the  graph 

(Ai,Ui) 

the  ith  eigen-pair  of  A 

A 

first  eigen-value  of  A  (i.e.,  A  =  Ai) 

u 

first  eigen-vector  of  A  (i.e.,  u  =  ui) 

A(b,A(s) 

first  eigen-value  of  A  by  deleting 

node  i  (or  the  set  of  nodes  in  S) 

AA(i) 

eigen-drop:  AA(i)  =  A  —  A^ 

AA(S) 

eigen-drop:  AA(S)  =  A  —  A^ 

Sv(i) 

'Shield-value'  score  of  node  i 

Sv(S) 

'Shield-value'  score  of  nodes  in  S 

V(G) 

'Vulnerability'  score  of  the  graph 

6.4  Background:  Our  Solution  for  Problem  1 

Here,  we  focus  on  Problem  6.1.  We  suggest  using  the  first  eigenvalue  A  as  the  solution. 
We  should  point  out  that  it  not  our  main  contribution  to  adopt  A  as  the  'Vulnerability' 
measure  of  a  graph.  Nonetheless,  it  is  the  base  of  our  proposed  solutions  for  both 
Problem  2  and  Problem  3. 


6.4.1  ' Vulnerability '  Score 

In  Problem  6.1,  the  goal  is  to  measure  the  'Vulnerability'  of  the  whole  graph  by  a  single 
number.  We  adopt  the  first  eigenvalue  of  the  adjacency  matrix  A  as  such  a  measurement 
(eq.  (6.1)):  the  larger  A  is,  the  more  vulnerable  the  whole  graph  is. 

V(G)  =  A  (6.1) 

Figure  6.1  presents  an  example,  where  we  have  four  graphs  with  5  nodes.  Intuitively, 
the  graph  becomes  more  and  more  vulnerable  from  the  left  to  the  right.  In  other  words, 
for  a  given  strength  of  the  virus  attack,  it  is  more  likely  that  an  epidemic  will  break  out 
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Figure  6.1:  An  example  of  measuring  'Vulnerability'  of  the  graph.  More  edges,  and  carefully  placed, 
make  the  graph  better  connected,  and  thus  more  vulnerable.  Notice  that  the  chain  (a)  and  the  star  (b)  have 
the  same  number  of  edges,  but  our  A  score  correctly  considers  the  star  as  more  vulnerable. 


in  the  graphs  on  the  right  than  those  on  the  left  side.  Therefore,  the  vulnerability  of  the 
graph  increases  We  can  see  that  the  corresponding  A  increases  from  left  to  right  as  well. 

Notice  that  the  concept  of  'Vulnerability'  is  different  from  vertex  connectivity  of  the 
graph  [HT08].  For  'Vulnerability' ,  we  want  to  quantify  how  likily/easiy  a  graph  will  be 
infected  by  a  virus  (given  the  strength  of  virus  attack).  Whereas  for  vertex  connectivity, 
we  want  to  quantify  how  difficult  for  a  graph  to  be  disconnected.  For  example,  both 
graph  (a)  and  (b)  in  figure  6.1  have  the  same  vertex  connectivity  (both  are  Is).  But  graph 
(b)  is  more  vulnerable  to  the  virus  attack.  Also  notices  that  although  'Vulnerability'  is 
related  to  both  graph  density  (i.e.,  average  degree)  and  diameter,  neither  of  them  can 
fully  describe  the  'Vulnerability'  by  itself.  For  example,  in  figure  6.1,  (a)  and  (b)  share  the 
same  density /average  degree  although  (b)  is  more  vulnerable  than  (a);  (b)  and  (c)  share 
the  same  diameter  although  (c)  is  more  vulnerable  than  (b). 

6.4.2  Justifications 

The  first  eigenvalue  A  is  a  good  measurement  of  the  graph  'Vulnerability' ,  because  of  recent 
results  on  epidemic  thresholds  from  immunology  [CWW+08,  PCF+11]:  A  is  closely  related 
to  the  epidemic  threshold  t  of  a  graph  under  almost  any  epidemic  model  (including  the 
flu-like  SIS  (susceptible-infective-susceptible)  epidemic  model),  and  specifically  t  oc  1/A. 
This  means  that  a  virus  less  infective  than  t  will  quickly  get  extinguished  instead  of 
lingering  forever.  Therefore,  given  the  strength  of  the  virus  (that  is,  the  infection  rate  and 
the  host-recovery  rate),  an  epidemic  is  more  likely  for  a  graph  with  larger  A. 

We  can  also  show  that  the  first  eigenvalue  A  is  closely  related  to  the  so-called  loop 
capacity  and  the  path  capacity  of  the  graph,  that  is,  the  number  of  loops  and  paths  of 
length  1  (1  =  2, 3, . . .).  If  a  graph  has  many  such  loops  and  paths,  then  it  is  well  connected, 
and  thus  more  vulnerable  (i.e.,  it  is  easier  for  a  virus  to  propagate  across  the  graph  =  the 
graph  is  less  robust  to  the  virus  attack). 


6.5  Our  Solution  for  Problem  2 

In  this  section,  we  focus  on  Problem  6.4.  We  first  present  our  solution,  and  then  provide 
justifications. 
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6.5.1 


Proposed  ' Shield-value '  Score 


Figure  6.2:  An  example  on  measuring  the  'Shield-value'  score  of  a  given  set  of  nodes.  The  best  k  nodes 
found  by  our  NetShield  are  shaded.  In  (a),  notice  that  the  highest  degree  nodes  (e.g.,  node  1)  is  not 
chosen.  In  (b),  immunizing  the  shaded  nodes  makes  the  remaining  graph  most  robust  to  the  virus  attack. 

In  Problem  6.4,  the  goal  is  to  quantify  the  importance  of  a  given  set  of  nodes  S, 
and  specifically  the  impact  of  their  deletion/ immunization  to  the  'Vulnerability'  of  the 
rest  of  the  graph.  The  obvious  choice  is  the  drop  in  eigenvalue,  or  eigen-drop  AA  that 
their  removal  will  cause  to  the  graph.  We  propose  to  approximate  it,  to  obtain  efficient 
computations,  as  we  describe  later.  Specifically,  we  propose  using  Sv(S)  defined  as: 

Sv(S)  =^2Au(i)2-  Y_  A(i,j)u(i)u(j)  (6.2) 

i,j£S 

Intuitively,  by  eq.  (6.2),  a  set  of  nodes  S  has  higher  'Shield-value'  score  if  (1)  each  of 
them  has  a  high  eigen-score  (u(i)),  and  (2)  they  are  dissimilar  with  each  other  (small 
or  zero  A(i,  j)).  Figure  6.2  shows  an  example  on  measuring  the  'Shield-value'  score  of  a 
given  set  of  nodes.  The  best  k  nodes  found  by  our  NetShield  (which  will  be  introduced 
very  soon  in  the  next  section)  are  shaded.  The  result  is  consistent  with  intuition.  In 
figure  6.2(a),  it  picks  node  13  as  best  k  =  1  node  (although  nodes  1,  5  and  9  have  the 
highest  degree).  In  figure  6.2(b),  deleting  the  shaded  nodes  (node  1,  5,  9  and  13)  will 
make  the  graph  the  least  vulnerable  (i.e.,  the  remaining  graphs  are  sets  of  isolated  nodes; 
and  therefore  it  is  most  robust  to  virus  attack). 

6.5.2  Justifications 

Here,  we  provide  some  justifications  on  the  proposed  'Shield-value'  score,  which  is 
summarized  in  Lemma  6.1.  It  says  that  our  proposed  'Shield-value'  score  S v { S )  is  a 
good  approximation  for  the  eigen-drop  AA(S)  when  deleting  the  set  of  nodes  S  from  the 
original  graph  A. 
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Lemma  6.1.  Let  A,lS)  be  the  (exact)  first  eigen-value  of  A,  where  A  is  the  perturbed  version  of  A 
by  removing  all  of  its  rows/columns  indexed  by  set  3.  Let  6  =  A  —  A2  be  the  eigen-gap,  and  d  be 
the  maximum  degree  of  A.  If  A  is  the  simple  first  eigen-value  of  A,  and  6  ^  2^2  led,  then 

AA(S)=Sz;(S)  +  0(^||A(:/j)||2)  (6.3) 

j  GS 

where  Sr>(3)  is  computed  by  eq.  (6.2)  and  AA(S)  =  A  —  A(s). 

Proof:  First,  let  us  write  A  as  a  perturbed  version  of  the  original  matrix  A: 

A  =  A  +  E,  and  E  =  F  +  F'  +  G  (6.4) 

where  F(:,j)  =  -A(:,j)  (j  6  §  and  F(:,j)  =  0  (j  ^  3);  G(i,j)  =  A(i,j)  (i,j  e  3)  and 
G(i,  j)  =  0(i  £  3,  or )  <£  3). 

Since  Au  =  Au,  we  have 

u'F'u  =  u'Fu  =  (F'u) 'u  =  —  Y_  Au(j)2 

j  GS 

u'Gu  =  Y  A(i,  j)u(i)u(j)  (6.5) 

i,j  GS 

Let  A  be  the  corresponding  perturbed  eigen-value  of  A,  according  to  the  matrix  perturba¬ 
tion  theory  (p.183  [SS90]),  we  have 

A  =  A  +  u'Eu  +  O  ( ||  E||2) 

=  A  +  u'Fu  +  u'F'u  +  u'Gu  +  0(||E||2) 

=  A -[Y  2Au(j)~  Y  A(i, j)u(i)u(j)) 

jGS  i,j  GS 

+0(^||A(:,j)||2) 

j  GS 

=  A-Sv(S)  +  0(^||A(:,j)||2)  (6.6) 

j  GS 

Let  Ai(i  =  2, ..., n)  be  the  corresponding  perturbed  eigen-value  of  At(i  =  2, ..., n).  Again, 
by  the  matrix  perturbation  theory  (p.203  [SS90]),  we  have 

A  >  A-||E||2  >  A-||E||F  ^  A- v/2kd 

At  ^  Ai  +  ||E||2  ^  At  +  ||E||f  ^  At  +  V,2k.d  (6.7) 

Since  6  =  A  —  A2  ^  2v/2kd,  we  have  A  ^  At(i  =  2, ...,  n).  In  other  words,  we  have  A^  =  A. 
Therefore, 

AA(S)  =  A-A(S)=A-A 

=  Sv(S)  +  0(^||A(:,j)||2)  (6.8) 

j  GS 

which  completes  the  proof.  □ 
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6.6  Our  Solution  for  Problem  3 


In  this  section,  we  deal  with  Problem  6.3.  Here,  the  goal  is  to  find  a  subset  of  k  nodes 
with  the  highest  'Shield-value'  score  (among  all  (£)  possible  subsets).  We  start  by  showing 
that  the  two  straight-forward  methods  (referred  to  as  'Com-Eigs',  and  'Com-Eval')  are 
computationally  intractable.  Then,  we  present  the  proposed  NETSHIELD  algorithm. 
Finally,  we  analyze  its  accuracy  as  well  as  its  computational  complexity. 


6.6.1  Preliminaries 

There  are  two  obviously  straight-forward  methods  for  Problem  6.3.  The  first  one  (referred 
to  as  'Com-Eigs'1)  works  as  follows:  for  each  possible  subset  S,  we  delete  the  correspond¬ 
ing  rows/ columns  from  the  adjacency  matrix  A;  compute  the  first  eigenvalue  of  the 
new  perturbed  adjacency  matrix;  and  finally  output  the  subset  of  nodes  which  has  the 
smallest  eigenvalue  (therefore  has  the  largest  eigen-drop).  Despite  the  simplicity  of  this 
strategy,  it  is  computational  intractable  due  to  its  combinatorial  nature.  It  is  easy  to  show 
that  the  computational  complexity  of  'Com-Eigs'  is  0(  (£)  •  m)2.  This  is  computationally 
intractable  even  for  small  graphs.  For  example,  in  a  graph  with  IK  nodes  and  10K  edges, 
suppose  that  it  takes  about  0.01  second  to  find  its  first  eigen-value.  Then  we  need  about 
2,615  years  to  find  the  best-5  nodes  with  the  highest  'Shield-value'  score! 

A  more  reasonable  (in  terms  of  speed)  way  to  find  the  best-k  nodes  is  to  evaluate 
Sv(S),  rather  than  to  compute  the  first  eigen-value  A§,  (£)  times,  and  pick  the  subset 
with  the  highest  Sv(S).  We  refer  to  this  strategy  as  'Com-Eval'.  Compared  with  the 
straight-forward  method  (referred  to  as  'Com-Eigs',  which  is  0(  (£)  •  m));  'Com-Eval'  is 
much  faster  (0(  (£)  ■  k2)).  However,  'Com-Eval'  is  still  not  applicable  to  real  applications 
due  to  its  combinatorial  nature.  Again,  in  a  graph  with  IK  nodes  and  10K  edges,  suppose 
that  it  only  takes  about  0.00001  second  to  evaluate  Sv(S)  once.  Then  we  still  need  about 
3  months  to  find  the  best-5  nodes  with  the  highest  'Shield-value'  score! 


6.6.2  Proposed  NetShield  Algorithm 

The  proposed  NetShield  is  given  in  Alg.  1.  In  Alg.  1,  we  compute  the  first  eigenvalue 
A  and  the  corresponding  eigenvector  u  in  step  1.  In  step  4,  the  n  x  1  vector  v  measures 
the  'Shield-value'  score  of  each  individual  node.  Then,  in  each  iteration  of  steps  6-17,  we 
greedily  select  one  more  node  and  add  it  into  set  S  according  to  score (j)  (step  13).  Note 
that  steps  10-12  are  to  exclude  those  nodes  that  are  already  in  the  selected  set  S. 

1In  fact,  we  can  prove  Problem  6.3  is  NP-hard,  using  a  slight  variant  of  the  proof  of  Theorem  [fill]  in 
Chapter  7. 

2We  assume  that  k  is  relatively  small  compared  with  n  and  m  (e.g.,  tens  or  hundreds).  Therefore,  after 
deleting  k  rows/columns  from  A,  we  still  have  O(m)  edges. 
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Algorithm  1  NETSHIELD 

Input:  the  adjacency  matrix  A  and  an  integer  k 
Output:  a  set  S  with  k  nodes 

1:  compute  the  first  eigen-value  A  of  A;  let  u  be  the  corresponding  eigen-vector  u(j)(j  = 

1/-V  n); 

2:  initialize  S  to  be  empty; 

3:  for  j  =  1  to  n  do 

4:  v(j)  =  (2  ■  A  —  A(j,  j))  ■  u(j)2; 

5:  end  for 

6:  for  iter  =  1  to  k  do 
7:  let  B  =  A(:,  S); 

8:  let  b  =  B  •  u(S); 

9:  for  j  =  1  to  n  do 

10:  if  j  G  S  then 

11:  let  score(j)  =  —  1; 

12:  else 

13:  let  score(j)  =  v(j)  —  2  •  b(j)  ■  u(j); 

14:  end  if 

15:  end  for 

16:  let  i  =  argmax.  score  (j),  add  i  to  set  S; 

17:  end  for 

18:  return  S. 


6.6.3  Analysis  of  NetShield 

Here,  we  analyze  the  accuracy  and  efficiency  of  the  proposed  NetShield. 

First,  according  to  the  following  theorem,  Alg.  1  is  near-optimal  wrt  'Com-Eval'.  In 
addition,  by  Lemma  6.1,  our  'Shield-value'  score  (which  'Com-Eval'  tries  to  optimize)  is  a 
good  approximation  for  the  actual  eigen-drop  AA(S)  (which  'Com-Eigs'  tries  to  optimize). 
Therefore,  we  would  expect  that  Alg.  1  also  gives  a  good  approximation  wrt  'Com-Eigs' 
(See  Section  6.7  for  experimental  validation). 


Theorem  6.1.  Effectiveness  of  NetShield.  Let  S  and  S  be  the  sets  selected  by  Alg.  1  and 

by  'Com-Eval',  respectively.  Let  AA(S)  and  AA(S)  be  the  corresponding  eigen-drops.  Then, 
AA(S)  ^  (1  — l/e)AA(S). 


Proof:  Let  J,  3,  X  be  three  sets  and  J  C  0.  Define  the  following  three  sets  based  on  J,  3,  X: 
S  =  3UX,  7  =  3UX,  X  =  3\l 
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Substituting  eq.(6.2),  we  have 

Sv(S)  -  Sv(J)  =  Y  2Au(i)2  —  Y  A(i,  j)u(i)u(j) 

i  EOC  i,je3C 

-2  Y_  A(i,j)u(i)u(j) 

j  G  J,iG  OC 

Sv(T)  -  Sv(fl)  =  y  2Au(i)2  —  Y  A(i, j)u(i)u(j) 

ie3C  i,je9C 

-  2  Y  A(i, j)u(i)u(j)  (6.9) 

)  G 3,i(zJC 

According  to  Perron-Frobenius  theorem,  we  have  ufi)  ^  0 ( i  =  Therefore, 

(Sv(S)  -  Sv(J))-(Sv(T)-Sv(3)) 

=  2  Y_  A(i,j)u(i)u(j)  ^  0  (6.10) 

=>  Sv(S)  -  Sv(J)  ^  Sv(T)  -  Sv(3) 

Therefore,  the  function  Sv(S)  is  sub-modular. 

Next,  we  can  verify  that  node  i  selected  in  step  16  of  Alg.  1  satisfies  i  =  argmax.^sSv(SU 
j)  for  a  fixed  set  S. 

Next,  we  prove  that  Sv(S)  is  monotonically  non-decreasing  wrt  S.  According  to 
eq.  (6.9),  we  have 

Sv(S)  -  Sv(J)  =  y  2Au(i)2  —  y  A(i,j)u(i)u(j) 

-  2  y_  A(i, j)u(i)u(j) 

j  G 

>  y_  2Au(i)2  —  2  y_  A(i,  j)u(i)u(j) 

ieoc  jes,ie3C 

=  2y  u(i)(Au(i)~y  A(i, j)u(j)) 

j  G  S 
n 

^  2  y_  u(i)(Au(i)  —  y_  A(i,  j)u(j)) 

ie3C  j=l 

=  2  y_  u(i)(Au(i)  —  Au(i))  =  0  (6.11) 

ie3C 

where  the  second  last  equality  is  due  to  the  definition  of  eigenvalue. 

Finally,  it  is  easy  to  verify  that  Sv(4>)  =  0,  where  (f>  is  an  empty  set.  Using  the  property 
of  sub-modular  functions  [KG07],  we  have  AA(S)  ^  (1  —  l/e)AA(S).  □ 
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According  to  Lemma  6.2,  the  computational  complexity  of  Alg.  1  is  0(uk2  +  m), 
which  is  much  faster  than  both  'Com-Eigs'  (0((£)m))  and  'Com-EvaT  (0(  (£)k2)). 

Lemma  6.2.  Computational  Complexity  of  NetShield.  The  computational  complexity  of 
Alg.  1  is  0(uk2  +  m). 

Proof:  Omitted  for  brevity.  □ 

Finally,  according  to  Lemma  6.3,  the  space  cost  of  Alg.  1  is  also  efficient  (i.e.,  linear 
wrt  the  size  of  the  graph). 

Lemma  6.3.  Space  Cost  of  NetShield.  The  space  cost  of  Alg.  1  is  0(n  +  m  +  k). 

Proof:  Omitted  for  brevity.  □ 


6.7  Experimental  Evaluations  (Static  Graphs) 

We  present  detailed  experimental  results  in  this  section.  All  the  experiments  are  designed 
to  answer  the  following  questions: 

1:  ( Effectiveness )  How  effective  is  the  proposed  Sv(S)  in  real  graphs? 

2:  ( Efficiency )  How  fast  and  scalable  is  the  proposed  NetShield? 

6.7.1  Data  sets 


Table  6.2:  Summary  of  the  data  sets 


Name 

n 

m 

Karate 

34 

152 

AA 

418,236 

2,753,798 

NetFlix 

2,667,199 

171,460,874 

We  used  three  real  data  sets,  which  are  summarized  in  table  6.2.  The  first  data  set 
( Karate )  is  a  unipartite  graph,  which  describes  the  friendship  among  the  34  members  of  a 
karate  club  at  a  US  university  [Zac77],  Each  node  is  a  member  in  the  karate  club  and  the 
existence  of  the  edge  indicates  that  the  two  corresponding  members  are  friends.  Overall, 
we  have  n  =  34  nodes  and  m  =  156  edges. 

The  second  data  set  (AA)  is  an  author-author  network  from  DBLP.3  AA  is  a  co¬ 
authorship  network,  where  each  node  is  an  author  and  the  existence  of  an  edge  indicates 
the  co-authorship  between  the  two  corresponding  persons.  Overall,  we  have  n  = 

3http : / / www . inf ormatik . uni-trier . de/„ley/db/ 
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418, 236  nodes  and  m  —  2, 753, 798  edges.  We  also  construct  much  smaller  co-authorship 
networks,  using  the  authors  from  only  one  conference  (e.g.,  KDD,  SIGIR,  SIGMOD,  etc.). 
For  example,  KDD  is  the  co-authorship  network  for  the  authors  in  the  'KDD'  conference. 
For  these  smaller  co-authorship  networks,  they  typically  have  a  few  thousand  nodes  and 
up  to  a  few  ten  thousand  edges. 

The  last  data  set  ( NetFlix )  is  from  the  Netflix  prize.4  This  is  also  a  bipartite  graph.  We 
have  two  types  of  nodes:  user  and  movie.  The  existence  of  an  edge  indicates  that  the 
corresponding  user  has  rated  the  corresponding  movie.  Overall,  we  have  n  =  2, 667, 199 
nodes  and  m  =  171,460, 874  edges.  This  is  a  bipartite  graph,  and  we  convert  it  to  a 


unipartite  graph  A:  A 


where  0  is  a  matrix  with  all  zero  entries  and  B  is  the 


adjacency  matrix  of  the  bipartite  graph. 


6.7.2  Effectiveness 

Here,  we  first  test  the  approximation  accuracy  of  the  proposed  SvfS).  Then,  we  compared 
the  different  immunization  policies,  followed  by  some  case  studies.  Notice  that  the 
quality  vs.  speed  trade-off  for  the  proposed  NetShield,  the  optimal  'Com-Eigs'  and  the 
alternative  greedy  method  is  presented  in  subsection  6.7.3. 


Approximation  quality  of  SvfS) 

The  proposed  NETSHIELD  is  based  on  eq.  (6.2).  That  is,  we  want  to  approximate  the 
first  eigen-value  of  the  perturbed  matrix  by  A  and  u.  By  Lemma  6.1,  it  says  that  SvfS) 
is  a  good  approximation  for  the  actual  eigen-drop  AAfS).  Here,  let  us  experimentally 
evaluate  how  good  this  approximation  is  on  real  graphs.  We  construct  an  authorship 
network  from  one  of  the  following  conferences:  'KDD',  'ICDM' ,  'SDM'  and  'SIGMOD'. 
We  then  compute  the  linear  correlation  coefficient  between  AAfS)  and  SvfS)  with  several 
different  k  values  (k  =  1, 2, 5, 10, 20).  The  results  are  shown  in  table  6.3.  It  can  be  seen 
that  the  approximation  is  very  good  -  in  all  the  cases,  the  linear  correlation  coefficient  is 
greater  than  0.95.  Figure  6.3  gives  the  scatter  plot  of  AAfS)  (i.e.,  the  actual  eigen-drop)  vs. 
SvfS)  (i.e.,  the  proposed  'Shield-value')  for  k  =  5  the  on  'ICDM'  data  set. 


Immunization  by  NetShield 

The  proposed  'Vulnerability'  score  of  the  graph  is  motivated  by  the  epidemic  thresh¬ 
old  [PCF+11],  As  a  consequence,  the  proposed  NETSHIELD  leads  to  a  natural  immu¬ 
nization  strategy  for  the  SIS  model  (susceptible-infective-susceptible,  like,  e.g.,  the  flu): 
quarantine  or  delete  the  subset  of  the  nodes  detected  by  NetShield  in  order  to  prevent 

4http : / / www . netf lixprize . com/ 
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Table  6.3:  Evaluation  on  the  approximation  accuracy  of  f(S).  Larger  is  better. 


k 

'KDD' 

'ICDM' 

'SDM' 

'SIGMOD' 

1 

0.9519 

0.9908 

0.9995 

1.0000 

2 

0.9629 

0.9910 

0.9984 

0.9927 

5 

0.9721 

0.9888 

0.9992 

0.9895 

10 

0.9726 

0.9863 

0.9987 

0.9852 

20 

0.9683 

0.9798 

0.9929 

0.9772 

Figure  6.3:  Evaluation  of  the  approximation  accuracy  of  Sv(§>)  on  the  'ICDM'  graph.  The  proposed 
'Shield-value'  Sv  (y-axis)  gives  a  good  approximation  for  the  actual  eigen-drop  AA(S)  (x-axis).  Most 
points  are  on  or  close  to  the  diagonal  (ideal). 

an  epidemic  from  breaking  out.  5 

We  compare  it  with  the  following  alternative  choices:  (1)  picking  a  random  neigh¬ 
bor  of  a  randomly  chosen  node[CHbA03]  ('Aquaintance'),  (2)  picking  the  nodes  with 
the  highest  eigen  scores  u(i)(i  =  1,  ...,n)  ('Eigs')6,  (3)  picking  the  nodes  with  the 
highest  abnormality  scores  [SQCF05]  ('abnormality'),  (4)  picking  the  nodes  with  the 
highest  betweenness  centrality  scores  based  on  the  shortest  path  [Fre77]('Short'),  (5) 
picking  the  nodes  with  the  highest  betweenness  centrality  scores  based  on  random 
walks  [New05b]('N.RW'),  (6)  picking  the  nodes  with  the  highest  degress  ('Degree'),  and 
(7)  picking  the  nodes  with  the  highest  PageRank  scores  [PBMW98]('PageRank').  For 
each  method,  we  delete  5  nodes  for  immunization.  Let  s  =  A  ■  b/d  be  the  normalized 
virus  strength  (bigger  s  means  more  stronger  virus),  where  b  and  d  are  the  infection  rate 
and  death  rate,  respectively.  The  result  is  presented  in  figure  6.4,  which  is  averaged  over 
1000  runs.  It  can  be  seen  that  the  proposed  NetShield  is  always  the  best,  -  its  curve 
is  always  the  lowest  which  means  that  we  always  have  the  least  number  of  infected 
nodes  in  the  graph  with  this  immunization  strategy.  Notice  that  the  performance  of 

5Infact  our  NETSHIELD  will  also  help  with  the  immunization  for  almost  any  model  (as  the  threshold  is 
dependent  on  A). 

6For  the  un-directed  graph  which  we  focus  on  in  this  work,  'Eigs'  is  equivalent  to  'HITS' [Kle98]. 
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(a)  s  =  1.4 


(b)  s  =  2.9 


Figure  6.4:  Evaluation  of  immunization  of  NetShield  on  the  Karate  graph.  The  fraction  of  infected 
nodes  (in  log-scale)  vs.  the  time  step,  s  is  normalized  virus  strength.  Lower  is  better.  The  proposed 
NetShield  is  always  the  best,  leading  to  the  fastest  healing  of  the  graph.  Best  viewed  in  color. 


'Eigs'  is  much  worse  than  the  proposed  NETSHIELD.  This  indicates  that  by  collectively 
finding  a  set  of  nodes  with  the  highest  'Shield-value',  we  indeed  obtain  extra  performance 
gain  (compared  with  naively  choosing  the  top-k  nodes  which  have  the  highest  individual 
'Shield-value'  scores). 

Case  studies 

Next,  we  will  show  some  case  studies  to  illustrate  the  effectiveness  of  the  proposed  Sv(S) 
as  a  'Shield-value'  score  of  a  subset  of  nodes. 

We  run  the  proposed  NetShield  on  AA  data  set  and  return  the  best  k  =  200  authors. 
Some  representative  authors,  to  name  a  few,  are  'Sudhakar  M.  Reddy'  'Wei  Wang’  'Heinrich 
Niemann',  'Srimat  T.  Chakradhar' ,  'Philip  S.  Yu',  'Lei  Zhang',  'Wei  Li',  'Jiawei  Han',  'Srinivasan 
Parthasarathy' ,  'Srivaths  Ravi',  'Antonis  M.  Paschalis',  'Mohammed  Javeed  Zaki',  'Lei  Li', 
'Dimitris  Gizopoulos',  'Alberto  L.  Sangiovanni-Vincentelli' ,  'Narayanan  Vijaykrishnan' ,  'Jason 
Cong',  'Thomas  S.  Huang',  etc.  We  can  make  some  very  interesting  observations  from  the 
result: 

1  There  are  some  multi-disciplinary  people  in  the  result.  For  example.  Prof.  Alberto 
L.  Sangiovanni-Vincentelli  from  UC  Berkeley  is  interested  in  'design  technology', 
'cad',  'embedded  systems',  and  'formal  verification';  Prof.  Philip  S.  Yu  from  UIC  is 
interested  in  'databases',  'performance',  'distributed  systems'  and  'data  mining'. 

2  Some  people  show  up  because  they  are  famous  in  one  specific  area,  and  occasionally 
have  one/ two  papers  in  a  remotely  related  area  (therefore,  increasing  the  path 
capacity  between  two  remote  areas).  For  example.  Dr.  Srimat  T.  Chakradhar  mainly 
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focuses  on  'cad'.  But  he  has  co-authored  in  a  'NIPS'  paper.  Therefore,  he  creates  a 
critical  connection  between  these  two  (originally)  remote  areas:  'cad'  and  'machine 
learning'. 

3  Some  people  show  up  because  they  have  ambiguous  names  (e.g.,  Wei  Wang,  Lei  Li, 
Lei  Zhang,  Wei  Li,  etc.).  Take  'Wei  Wang'  as  an  example;  according  to  DBLP,7  there 
are  7  different  'Wei  Wang's.  In  our  experiment,  we  treat  all  of  them  as  one  person. 
That  is  to  say,  it  is  equivalent  to  putting  an  artificial  'Wei  Wang'  in  the  graph  who 
brings  7  different  'Wei  Wang's  together.  These  7  'Wei  Wang's  are  in  fact  spread  out 
in  quite  different  areas,  (e.g.,  Wei  Wang@UNC  is  in  'data  mining'  and  'bio';  Wei 
Wang@NUS  is  in  'communication';  Wei  Wang@MIT  is  in  'non-linear  systems'.  ) 

6.7.3  Efficiency 

We  will  study  the  wall-clock  running  time  of  the  proposed  NetShield  here.  Basically, 
we  want  to  answer  the  following  three  questions: 

1.  (Speed)  What  is  the  speedup  of  the  proposed  NetShield  over  the  straight-forward 
methods  ('Com-Eigs'  and  'Com-Eval')? 

2.  (Scalability)  How  does  NetShield  scale  with  the  size  of  the  graph  (n  and  m)  and 
k? 

3.  (Quality /Speed  Trade-Off)  How  does  NetShield  balance  between  the  quality  and 
the  speed? 

For  the  results  we  report  in  this  subsection,  all  of  the  experiments  are  done  on  the  same 
machine  with  four  2.4GHz  AMD  CPUs  and  48GB  memory,  running  Linux  (2.6  kernel).  If 
the  program  takes  more  than  1,000,000  seconds,  we  stop  running  it. 
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Figure  6.5:  Wall-clock  time  vs.  the  budget  k  for  different  methods.  The  time  is  in  the  logarithmic  scale. 
Our  NetShield  (red  star)  is  much  faster.  Lower  is  better. 

7http : / / www . inf ormatik . uni-trier . de/~ey/ db/ indices 
/ a-tree/w/Wang : Wei . html 
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First,  we  compare  NETSHIELD  with  'Com-Eigs'  and  'Com-Eval'.  Figure  6.5  shows 
the  comparison  on  three  real  data  sets.  We  can  make  the  following  conclusions:  (1) 
Straight-forward  methods  ('Com-Eigs'  and  'Com-Eval')  are  computationally  intractable 
even  for  a  small  graph.  For  example,  on  the  Karate  data  set  with  only  34  nodes,  it  takes 
more  than  100,000  and  1,000  seconds  to  find  the  best-10  by  'Com-Eigs'  and  by  'Com- 
Eval',  respectively.  (2)  The  speedup  of  the  proposed  NetShield  over  both  'Com-Eigs' 
and  'Com-Eval'  is  huge  -  in  most  cases,  we  achieve  several  (up  to  7)  orders  of  magnitude 
speedups!  (3)  The  speedup  of  the  proposed  NetShield  over  both  'Com-Eigs'  and 
'Com-Eval'  quickly  increases  wrt  the  size  of  the  graph  as  well  as  k.  (4)  For  a  given  size 
of  the  graph  (fixed  n  and  m),  the  wall-clock  time  is  almost  constant  -  suggesting  that 
NETSHIELD  spends  most  of  its  running  time  in  computing  A  and  u. 

Next,  we  evaluate  the  scalability  of  NetShield.  From  figure  6.6,  it  can  be  seen  that 
NETSHIELD  scales  linearly  wrt  both  n  and  m,  which  means  that  it  is  suitable  for  large 
graphs. 


(a)  changing  n  (fix  m  =  119,460) 


(b)  changing  m  (fix  n  =  2, 667, 119) 

Figure  6.6:  Evaluation  of  the  scalability  of  the  proposed  NetShield  wrt.  n  (number  of  nodes)  and  m 
(number  of  edges),  respectively.  The  zvall-clock  time  of  our  NetShield  scales  linearly  zvrt  n  and  m. 

Finally,  we  evaluate  how  the  proposed  NETSHIELD  balances  between  the  quality 
and  speed.  For  the  Karate  graph,  we  use  the  proposed  NetShield  to  find  a  set  of  k 
nodes  and  check  the  corresponding  eigen-drop  (i.e.,  the  decrease  of  the  first  eigen-value 
of  the  adjacency  matrix)  as  well  as  the  corresponding  wall-clock  time.  We  compare  it 
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with  'Com-Eigs',  which  always  gives  the  optimal  solutions  (i.e.,  it  returns  the  subset  that 
leads  to  the  largest  eigen-drop).  The  results  (eigen-drop  vs.  wall-clock  time)  are  plotted 
in  figure  6.7.  It  can  been  seen  that  NETSHIELD  gains  significant  of  speedup  over  the 
'Com-Eigs',  at  the  cost  of  a  small  fraction  of  quality  loss  (i.e.,  the  green  dash  lines  are 
near-flat). 


Wall-Clock  Time  (seconds) 


Figure  6.7:  Evaluation  of  the  quality/speed  trade  off.  Eigen-drop  vs.wall-clock  time,  with  different  budget 
k.TTie  proposed  NetShield  (red  star)  achieves  a  good  balance  between  eigen-drop  and  speed.  Note  that 
the  x-axis  (wall-clock  time)  is  in  logarithmic  scale.  The  number  inside  the  parenthesis  above  each  green 
dash  curve  is  the  ratio  of  eigen-drop  between  NetShield  and  'Com-Eigs'.  NetShield  is  optimal  when 
this  ratio  is  1.  Best  viewed  in  color. 

We  also  compare  the  proposed  NetShield  with  the  following  heuristic  (referred  to 
as  'Greedy'):  at  each  iteration,  we  re-compute  the  first  eigenvector  of  the  current  graph 
and  pick  a  node  with  the  highest  eigen-score  u(i);  then  we  delete  this  node  from  the 
graph  and  go  to  the  next  iteration.  For  the  NetFlix  graph,  we  find  a  set  of  k  nodes  and 
check  the  corresponding  eigen-drop  as  well  as  the  corresponding  wall-clock  time.  The 
quality/speed  trade-off  curve  is  plotted  in  figure  6.8.  From  the  figure,  we  can  make  two 
observations:  (1)  the  quality  of  the  two  methods  ('Greedy'  vs.  the  proposed  NetShield) 
are  almost  the  same  (note  that  the  green  dash  curves  in  the  plots  are  always  straight  flat); 
(2)  the  proposed  NetShield  is  always  faster  than  'Greedy'  (up  to  103x  speedup). 


6.8  Immunization  under  time-varying  graphs 

We  now  tackle  the  problem  of  immunization  under  a  time-varying  network.  Consider 
a  setting  with  clearly  different  behaviors  say,  day/ night,  each  characterized  by  a  cor¬ 
responding  adjacency  matrix  (Ai  for  day,  A2  for  night) — what  are  the  best  nodes  to 
immunize  to  prevent  an  epidemic  as  much  as  possible?  We  restrict  our  attention  to  the 
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Figure  6.8:  Comparison  of  NetShield  vs.'Greedy'.  The  proposed  NetShield  (red  star)  is  better  than 
'Greedy'  (i.e.,  faster,  with  the  same  quality).  Note  that  the  x-axis  (zvall-clock  time)  is  in  logarithmic  scale. 
The  number  inside  the  parenthesis  above  each  green  dash  curve  is  the  speedup  of  the  proposed  NetShield 
over  'Greedy'.  Best  viewed  in  color. 


SIS-model  only  for  this  problem.  More  generally,  the  problem  we  are  tackling  can  be 
formally  stated  as  follows: 

Problem  6.4.  Dynamic  Immunization 

Given:  (1)  T  alternating  behaviors,  characterized  by  a  set  of  T  graphs  A  =  {Ai,  A2  . . . ,  AT}; 

and  (2)  the  SIS  model  with  virus  parameters  [3  and  8  and  (3)  k  vaccines; 

Find:  The  best-k  nodes  for  immunization. 

6.8.1  Quality  Metric 

Using  our  results  in  Chapter  3,  we  can  evaluate  the  quality  of  any  immunization  policy. 
Recall  that  there  is  no  epidemic  if  the  largest  eigenvalue  of  the  corresponding  'system 
matrix'  is  below  one  in  magnitude.  Hence,  smaller  the  value  of  As,  lesser  are  chances  of 
the  virus  causing  any  epidemic.  Put  differently,  while  immunizing,  we  want  to  decrease 
the  As  value  of  the  system  as  much  as  possible.  Thus,  the  efficacy  of  any  immunization 
policy  should  be  measured  using  the  amount  of  "drop"  in  As  it  causes  and  the  resulting 
As  after  immunization. 

6.8.2  Proposed  immunization  policies 

We  now  discuss  some  new  immunization  policies  for  time-varying  graphs,  partially 
motivated  by  known  policies  used  for  static  graphs.  Again,  for  ease  of  exposition  we 
focus  our  attention  only  on  the  {Ai,  A2}  system  of  Section  3.3.  From  the  above,  it  is  clear 
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that  optimally  we  should  choose  that  set  of  k  nodes  which  result  in  the  smallest  As  value 
possible  after  immunization.  This  implies  that  for  each  set  of  k  node  we  test,  we  need  to 
delete  k  rows/ columns  from  both  Ai  and  A2  to  get  new  matrices  AJ  and  A2  and  then 
compute  the  new  As  value.  The  number  of  k  sets  is  (£)  and  therefore  this  method  is 
combinatorial  in  nature  and  will  be  intractable  even  for  small  graphs.  Nevertheless,  we 
call  this  strategy  Optimal  and  show  experimental  results  for  this  policy  too,  because  this 
policy  will  give  us  the  lower-bound  on  the  As  that  can  be  achieved  after  k  immunizations. 

We  want  policies  which  are  practical  for  large  graphs  and  at  the  same  time  be  efficient 
in  lowering  the  As  value  of  the  system  i.e.  they  should  be  close  to  Optimal.  Specifically 
to  this  effect,  we  now  present  several  greedy  policies  as  well.  As  the  heuristics  are  greedy 
in  nature,  we  only  describe  how  to  pick  the  best  one  node  for  immunization  from  a  given 
set  of  Gi  and  G2  graphs.  Our  proposed  policies  are: 

Greedy-DmaxA  (Highest  degree  on  Ai  or  A2  matrices)  Under  this  policy,  at  each  step 
we  select  the  node  with  the  highest  degree  considering  both  the  A!  or  A2  adjacency 
matrices.  This  is  motivated  by  the  degree  immunization  strategy  used  for  static 
graphs. 

Greedy-DavgA  (Highest  degree  on  the  "average"  matrix)  We  select  the  node  with  the 
highest  degree  in  the  Aavg  matrix  where  Aavg  =  (Ai  +  A2)/2. 

Greedy-AavgA  (Acquaintance  immunization  [CHbA03]  on  the  average  matrix)  The  "ac¬ 
quaintance"  immunization  policy  works  by  picking  a  random  person,  and  then 
immunizing  one  of  its  neighbors  at  random  (which  will  probably  be  a  'hub').  We 
run  this  policy  on  the  Aavg  matrix. 

Greedy-S  (Greedy  on  the  system-matrix)  This  is  the  greedy  strategy  of  immunizing 
the  node  at  each  step  which  causes  the  largest  drop  in  As  value.  Note  that  even  this 
can  be  expensive  for  large  graphs  as  we  have  to  evaluate  the  first  eigenvalue  of  S 
after  deleting  each  node  to  decide  which  node  is  the  best. 

Optimal  Finally,  this  it  the  optimal  through  combinatorial  strategy  mentioned  above  of 
finding  the  best-k  set  of  nodes  which  decrease  As  the  most. 

6.8.3  Experimental  Setup 

We  conducted  a  series  of  experiments  using  the  MIT  Reality  Mining  data  set  [EPL09]. 
The  Reality  Mining  data  consists  of  104  mobile  devices  (cellular  phones)  monitored  over 
a  period  of  nine  months  (September  2004  -  June  2005).  If  another  participating  Bluetooth 
device  was  within  a  range  of  approximately  5-10  meters,  the  date  and  time  of  the  contact 
and  the  device's  MAC  address  were  recorded.  Bluetooth  scans  were  conducted  at  5- 
minute  intervals  and  aggregated  into  two  12-hour  period  adjacency  matrices  ( day  and 
night).  The  epidemic  simulations  were  accomplished  by  alternating  the  day  and  night 
matrices  over  the  period  of  simulation.  All  experiments  were  run  on  a  64-bit,  quad-core 
(2.5Ghz  each)  server  running  a  CentOS  linux  distribution  with  shared  72  GB  of  RAM. 
Simulations  were  conducted  using  a  combination  of  Matlab  7.9  and  Python  2.6. 
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6.8.4 


Results 


Figure  6.9:  Experiments  on  Reality  Mining  graphs:  A§  vs  k  for  different  immunization  policies. 
Lower  is  better.  Greedy-DavgA  clearly  drops  the  A§  value  aggressively  and  is  close  to  the 

Greedy-Opt. 

Figure  6.9  shows  the  As  value  after  immunizing  k  =  1, 2, . . . ,  10  nodes  using  each  of 
the  policies  outlined  above.  As  Optimal  and  Greedy-S  require  (3  and  6  as  inputs,  we 
set  (3  =  0.5, 6  =  0.1.  Running  Optimal  became  prohibitively  expensive  (>  4  hours  on  the 
MIT  reality  graphs)  after  k  =  7  -  hence  we  don't  show  data  points  for  k  ^  8  for  Optimal. 
Moving  on  to  the  greedy  strategies  we  find  that  Greedy-S  performs  the  best  after 
k  =  10  by  dropping  the  As  value  as  aggressively  as  possible  -  equal  to  Optimal  at  many 
places.  We  find  that  Greedy-DavgA  also  performs  very  well.  Intuitively  this  is  because 
the  highest  degree  node  in  Aavg  is  very  well-connected  and  hence  has  a  huge  effect  in 
reducing  the  Aavg  value  (we  discuss  more  about  Aavg  below  later).  At  the  same  time, 
Greedy-DmaxA  is  comparable  to  Greedy-DavgA  as  we  find  the  highest  degree  among 
both  the  graphs:  so  this  highest  degree  will  also  mostly  have  the  highest  mean  degree. 
Finally,  Greedy-AavgA  drops  the  As  value  the  least  among  all  the  policies.  Given  the 
first  random  choice  of  node,  Greedy-AavgA  can  be  "trapped"  in  the  neighborhood  of  a 
node  far  form  the  best  node  to  immunize,  and  thus  can  be  forced  to  make  a  choice  based 
on  the  limited  local  information. 

Figure  6.10  demonstrates  the  effectiveness  of  our  quality  metric  i.e.  the  As  value 
for  each  immunization  policy  after  k  =  10  immunizations.  It  is  a  scatter  plot  of  Max. 
infections  till  steady  state  and  the  various  As  values  at  the  end  of  the  immunizations.  So 
points  closer  to  the  origin  (minimum  footprint  and  As)  represent  better  policies.  Clearly, 
Optimal  should  theoretically  be  the  closest  to  the  origin  (we  don't  show  it  as  it  didn't 
finish).  Also  as  discussed  before,  Greedy-AavgA  is  the  worst  and  that  is  demonstrated 
by  its  point.  From  Figure  6.9  we  can  see  that  Greedy-S  has  the  least  As  value  after  k  =  10, 
hence  it  is  closest  to  the  origin  and  thus  has  the  smallest  footprint.  Others  perform  well 
too,  as  their  final  As  values  were  close  as  well. 

To  summarize,  in  our  experiments  we  demonstrated  that  policies  decreasing  As  the 
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Figure  6.10:  Scatter  plot  of  Max.  infections  till  steady  state  and  As  for  different  immunization 
policies  after  k  =  10  immunizations.  Points  closer  to  the  origin  are  better.  All  policies  perform  in 
accordance  to  the  As  values  achieved  (see  Figure  6.9). 

most  are  the  best  policies  as  they  result  in  smaller  footprints  as  well.  Also,  simple  greedy 
policies  were  effective  and  achieved  similar  As  values  like  expensive  combinatorial 
policies. 

6.8.5  Discussion 

We  discuss  some  pertinent  issues  and  give  additional  explanations  in  this  section. 

Goodness  of  the  Aavg  matrix:  We  saw  that  Greedy-DavgA  gave  very  good  results 
and  was  close  to  Greedy-S  and  Optimal.  This  can  be  explained  with  the  help  of  the 
following  lemma. 

Lemma  6.4.  (1  —  26)1  T  2|3Aavg  is  a  first-order  approximation  of  the  S  matrix. 

Proof.  Note  that  (T  =  2), 

S  =  Si  x  S2 

=  ((1  —  6)1  +  |3Ai)  x  ((1  —  6)1  +  |3A2) 

=  (1  —  6)2I  +  (1  —  6)  (3  (Ai  +  A2)  +  |3“A1A2 

«  (1  -  25)1  +  |3(A!  +  A2)  =  (1  -  26)1  +  2(3  (  Al  +  A 2) 

where  we  neglected  second  order  terms  involving  (3  and  6.  Thus  (1  —  26)1  +  2(3Aavg  is  a 
first-order  approximation  of  the  S  matrix.  □ 

In  other  words,  we  can  consider  the  time-varying  system  to  be  approximated  by  a  static 
graph  system  of  the  Aavg  graph  adjacency  matrix  with  a  virus  of  the  same  strength  ((3/6 
remains  the  same).  The  threshold  for  a  static  graph  with  adjacency  matrix  A  is  |3Aa/6. 
So  in  our  static  case,  we  should  aim  to  reduce  AAavg  as  much  as  possible.  Any  policy 
which  aims  to  reduce  the  AAavg  value  will  approximate  our  original  goal  of  dropping  the 
As  value.  Thus,  this  gives  a  theoretical  justification  of  why  Greedy-DavgA  gave  good 
results. 
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Temporal  Immunization:  In  this  section,  we  concentrated  only  on  static  immunization 
policies  -  policies  where  once  immunized,  a  node  is  'removed'  from  the  contact  graphs. 
While  this  makes  sense  for  biological  vaccinations,  in  a  more  complex  'resource'  oriented 
scenario  where  each  'glove'  protects  a  person  only  for  the  time  it  is  worn,  a  time-varying 
immunization  policy  might  be  more  useful,  e.g.,  we  may  have  finite  number  of  gloves  to 
give  and  we  can  change  the  assignment  of  gloves  during  day/night.  In  this  case,  it  may 
be  better  to  immunize  nurses  in  hospitals  during  the  day  by  giving  them  the  gloves  but 
during  the  night,  we  can  decide  to  give  gloves  to  restaurant  waiters  or  children,  because 
the  nurses  are  now  not  well-connected  in  the  contact  graph.  Our  threshold  results  can 
trivially  estimate  the  impact  or  any  'what-if '  scenarios  w.r.t.  such  temporal  immunization 
algorithms. 


6.9  Conclusion 

We  studied  the  problem  of  immunization  (node-removal):  both  on  graphs  which  are 
fixed  and  graphs  which  change  with  time  in  this  chapter.  Simply  put  we  asked  the 
question:  "what  are  the  best-k  nodes  to  immunize  to  defend  against  an  epidemic  on  static 
and  time-varying  graphs?".  This  problem  is  closely  dependent  on  the  epidemic  threshold 
problem  which  we  addressed  in  the  previous  chapters.  Besides  the  problem  definitions, 
our  main  contributions  are: 

1.  A  novel  definition  of  'Shield-value'  score  for  a  fixed  graph,  Sv(S)  for  a  set  of  nodes 
S,  by  carefully  using  the  results  from  the  theory  of  matrix  perturbation;  (Sv()  is 
a  good  approximation  to  the  eigen-drop  AA,  the  reduction  of  'Vulnerability' ,  see 
Lemma  6.1). 

2.  For  the  static  case,  we  gave  a  near-optimal  and  scalable  algorithm  (NetShield)  to 
find  a  set  of  nodes  with  the  highest  'Shield-value'  score,  by  showing  that  our  setting 
has  the  sub-modularity  property  (i.e..  Theorem  6.1).  Moreover,  NETSHIELD  also 
scales  linearly  with  the  size  of  the  graph  (number  of  edges).  We  also  developed 
several  heuristics  for  time-varying  graphs. 

3.  Extensive  experiments  on  several  real  data  sets,  illustrating  both  the  effectiveness 
as  well  as  the  efficiency  of  our  methods,  sometimes  outperforming  competitors  by 
several  orders  of  magnitude. 

A  promising  research  direction  is  to  parallelize  the  current  methods  (e.g.,  using 
Hadoop8). 


8http:/ /hadoop.apache.org/ 
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Chapter  7 

Fractional  Immunization 


In  the  previous  chapter,  we  studied  the  problem  of  immunization  as  completely  removing 
nodes  from  the  network.  While  preventing  contagion  in  networks  is  an  important 
problem  in  public  health  and  other  domains,  the  assumption  that  selected  nodes  can 
be  rendered  completely  immune  does  not  hold  for  infections  for  which  there  is  no 
vaccination  or  effective  treatment.  Instead,  one  can  confer  fractional  immunity  to  some 
nodes  by  allocating  variable  amounts  of  infection-prevention  resource  to  them.  We 
formulate  the  problem  to  distribute  a  fixed  amount  of  resource  across  nodes  in  a  network 
such  that  the  infection  rate  is  minimized,  prove  that  it  is  NP-complete  and  derive  a 
highly  effective  and  efficient  linear-time  algorithm.  We  demonstrate  the  efficiency  and 
accuracy  of  our  algorithm  compared  to  several  other  methods  using  simulation  on  real- 
world  network  datasets  including  US-MEDICARE  and  state-level  interhospital  patient 
transfer  data.  We  find  that  concentrating  resources  at  a  small  subset  of  nodes  using  our 
algorithm  is  up  to  6  times  more  effective  than  distributing  them  uniformly  (as  is  current 
practice)  or  using  network-based  heuristics.  To  the  best  of  our  knowledge,  we  are  the 
first  to  formulate  the  problem,  use  truly  nation-scale  network  data  and  propose  effective 
algorithms. 


7.1  Introduction 

Given  a  graph  and  vaccines  which  provide  partial  ('fractional')  protection,  how  to  dis¬ 
tribute  them  to  maximize  lives  saved?  Networks  carry  harmful  agents,  e.g.,  disease, 
computer  viruses,  and  even  misinformation.  The  networks'  structure  dictates  how 
rapidly  the  malicious  agent  will  spread.  One  can  take  advantage  of  this  structure  to  iden¬ 
tify  specific  nodes  for  infection  control,  such  that  the  spread  of  the  disease  is  significantly 
curtailed.  In  selecting  nodes  for  infection  control,  previous  work  has  assumed  that  nodes 
can  be  rendered  completely  immune.  Elowever,  in  many  cases  the  complete  immuniza¬ 
tion  of  a  node  is  not  an  option.  Bacteria  present  in  hospitals  have  developed  resistance 
to  most  antibiotics.  Vaccines  take  time  to  be  developed  for  both  human  and  computer 
viruses,  prompting  other  measures  to  prevent  epidemics.  However,  one  can  provide 
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(c)  Degree 


(d)  Smart- Alloc 


Figure  7.1:  Our  proposed  SMART- ALLOC  method  minimizes  number  of  infections  (red  circles): 
(a)  The  US-MEDICARE  network  overlayed  on  a  map  (b-d)  Infected  hospitals  after  a  year  (365 
days)  under  different  immunization  algorithms.  The  same  amount  of  resources  (k  =  200)  were 
distributed  by  the  algorithms.  UNIFORM  is  the  largely  current  practice  of  distributing  evenly 
across  all  hospitals,  while  DEGREE  distributes  according  to  the  number  of  connections  of  a 
hospital. 


partial  ('fractional')  protection  by  allocating  resources  to  render  nodes  less  susceptible. 

In  this  chapter  we  formulate  the  problem  of  distributing  resources  to  minimize 
the  spread  of  infection  on  a  network.  Previously  devised  models,  which  assume  that 
allocating  a  single  unit  of  resource  to  a  node  renders  it  completely  immune  are  a  special 
case  of  this  more  general  problem.  We  illustrate  the  problem  in  two  settings:  the  spread 
of  infection  between  hospitals  through  patient  transfers,  and  the  spread  of  malicious  code 
between  individuals  through  transfers  of  computer  code  between  users  in  an  electronic 
setting. 

Consider  the  problem  of  prevention  of  hospital-to-hospital  transfer  of  drug  resistant 
bacteria.  Critically  ill  patients  are  frequently  and  routinely  transferred  between  hospi¬ 
tals  in  order  to  provide  necessary  specialized  care  [ICM+09].  While  such  interhospital 
transfers  are  an  essential  part  of  routine  patient  care,  they  also  enable  the  transfer  from 
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hospital  to  hospital  of  highly  virulent  micro-organisms  resistant  to  many  or  all  antibi¬ 
otics  [DWG10,  Liv03].  As  an  example,  recent  work  [Meal2]  implicates  inter-hospital 
patient  transfers  as  an  important  vehicle  for  spread  of  "superbug"  MRSA  (methicillin- 
resistant  Staphylococcus  aureus).  There  is  no  existing  technology,  short  of  isolating  a 
hospital,  that  can  completely  prevent  the  spread  of  such  micro-organisms.  To  disrupt 
transfers  by  removing  a  hospital  from  the  system  can  only  be  done  under  truly  extraordi¬ 
nary  circumstances  (such  as  the  outbreak  of  SARS  in  Toronto).  Instead,  there  are  large 
numbers  of  infection  control  technologies  (e.g.,  bottles  of  disinfectant)  that  offer  partial 
prevention  and  can  be  applied  at  individual  hospitals  (e.g.,  [ZIM+09]).  Since  such  infec¬ 
tion  control  technologies  are  costly,  how  should  policy-makers  optimally  deploy  them  in 
order  to  minimize  the  global  interhospital  spread  of  highly  resistant  micro-organisms 
via  patient  transfers? 

We  also  consider  the  spread  of  malicious  content  in  an  electronic  setting.  In  the 
Second  Life  virtual  world,  nearly  all  content,  from  the  landscape,  to  clothing,  to  the 
avatars'  movements,  are  created  and  distributed  as  scripts  by  the  users  themselves.  This 
is  part  of  the  interactivity  that  has  made  the  enterprise  a  success.  However  these  virtual 
environments  create  the  potential  for  malicious  scripts  to  be  inadvertently  picked  up  and 
dispersed  by  unwitting  users  1 .  Without  shutting  down  a  users'  account,  it  is  not  possible 
to  confer  complete  immunity  upon  that  node.  However,  one  could  allocate  resources 
differentially  to  a  subset  of  nodes,  in  the  form  of  educating  users  and  auditing  their  code 
inventories. 

Motivated  by  the  above  applications,  and  the  many  other  instances  where  complete 
immunization  is  not  feasible  (e.g.,  HIV  transmission,  or  H1N1  flu  transmission  prior 
to  availability  of  vaccine)  we  study  the  problem  of  effective  and  efficient  fractional 
immunization  on  directed  weighted  graphs.  In  fractional  immunization,  one  allocates 
differing  amounts  of  resource  to  nodes  from  a  fixed  total  budget.  Nodes  which  receive 
more  infection-prevention  resource  have  a  smaller  likelihood  of  becoming  infected  when 
exposed  than  nodes  receiving  no  or  little  resource.  A  straightforward  approach  that  tests 
each  possible  allocation  would  very  quickly  become  computationally  intractable  (e.g., 
for  a  network  with  2000  nodes,  it  will  take  more  time  than  the  age  of  the  universe  to 
examine  all  the  possibilities  on  a  2GHz  processor  machine).  Instead,  we  give  an  effective 
and  efficient  linear-time  (in  nodes  and  edges)  algorithm  Smart-Alloc  in  this  chapter. 

Our  extensive  experiments  show  that  we  may  achieve  significant  benefits  if  nodes 
coordinate  their  allocation  of  resources,  rather  than  making  allocations  independently, 
as  is  current  practice  in  many  settings.  See  Figure  7.1  for  an  illustration,  where  our 
algorithm  outperforms  other  alternatives  by  up  to  6x  fewer  infections. 

The  rest  of  the  chapter  is  organized  as  follows:  §  7.2  reviews  related  work,  §  7.3  gives 
the  problem  formulation  and  the  hardness  result,  §  7.4  and  §  7.5  explain  our  proposed 
method  and  §  7.6  presents  extensive  experiments  on  datasets.  Finally,  we  conclude  in 

■'■http :  / / www .  reuters  .  com/ article/2007 / 0 8 / 20/us-disease-game-idusn203105402007082  0 ? 
sp=true 
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§7.7. 


7.2  Related  Work 

In  short,  all  the  existing  immunization  strategies  mentioned  below  assume:  (1 )  full 
immunity  -  once  a  node  is  immunized,  it  is  completely  removed  from  the  graph;  (2)  binary 
allocation  (i.e.,  each  node  would  need  at  most  1  antidote);  and  (3)  symmetric  immunization 
-  once  applied,  an  antidote  affects  both  incoming  and  outgoing  edges.  These  assumptions 
might  be  too  strong  for  the  inter-hospital  patient  transfers  applications.  To  the  best  of  our 
knowledge,  we  are  the  first  to  address  the  more  realistic  and  challenging  setting,  where 
the  effect  of  an  antidote  could  be  partial  and  asymmetric  and  the  same  node  can  receive 
multiple  antidotes. 

We  review  related  work  in  the  context  of  networks  here,  which  can  be  categorized  into 
three  parts:  epidemic  thresholds,  immunization  algorithms  and  information  diffusion. 

Epidemic  Thresholds/ Conditions  Much  work  has  been  done  in  finding  epidemic  thresh¬ 
olds  (minimum  virulence  of  a  virus  which  results  in  an  epidemic)  for  a  variety  of  net¬ 
works  [Bai75,  McK25,  AM91,  KW93,  PSV02,  WCWF03,  GMT05,  PCF+11,  MNP06],  It 
should  be  pointed  out  that,  with  the  exception  of  [MNP06]  most  if  not  all  of  the  existing 
work  assumes  symmetry  in  virus  propagation.  That  is,  the  probability  that  A  infects  B  or 
B  infects  A  is  the  same,  assuming  either  A  or  B  are  infected.  The  inter-hospital  patient 
transfer  graph  is  asymmetric;  a  hospital  that  is  better  equipped  to  treat  a  critical  care 
patient  is  more  likely  to  be  on  the  receiving  end  of  a  transfer.  Asymmetries  in  transfer 
are  also  present  in  e.g.  email  networks. 

Immunization  Cohen  et  al.  [CHbA03]  studied  the  acquaintance  immunization  policy, 
and  showed  that  it  is  much  better  than  random,  for  both  the  SIS  as  well  as  the  SIR  model 
on  random  power-law  graphs.  Hayashi  et  al.  [HMM03]  modeled  e-mail  viruses  using 
the  SHIR  model  (Susceptible,  Hidden,  Infectious,  Recovered)  and  derived  the  extinction 
conditions  under  random  and  targeted  immunization.  Tong  et.al.  [TPT+10]  proposed 
an  effective  immunization  strategy  in  the  SIS  model  also  motivated  by  preventing  the 
spread  of  computer  viruses.  Briesemeister  et  al.  [BLP03]  studied  immunization  policies 
on  power-law  graphs.  Lappas  et.  al.  [LTGM10]  study  the  problem  of  finding  best-possible 
'culprits'  who  started  an  infection. 

General  information  diffusion  processes  There  is  a  lot  of  research  interest  in  studying 
dynamic  processes  on  large  graphs,  (a)  blogs  and  propagations  [GGLNT04,  KNRT03, 
KKT03,  RD02],  (b)  information  cascades  [BHW92,  GLM01,  Gra78]  and  (c)  marketing 
and  product  penetration  [Rog03,  LAH06].  These  dynamic  processes  are  all  closely 
related  to  virus  propagation.  For  example,  one  may  wish  to  allocate  third-party  "fact 
checking"  resources  to  content  posted  on  specific  blogs  in  order  to  minimize  the  spread 
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of  misinformation  in  the  network.  Although  no  blog  could  be  completely  immune  to 
spreading  misinformation,  such  efforts  would  dampen  its  spread. 


7.3  Problem  Formulation  and  Hardness  result 

We  first  formulate  the  problem  explicitly.  Let  A  be  the  adjacency  matrix  of  the  connected 
directed  weighted  graph  (of  N  nodes  and  E  edges)  on  which  the  virus  is  spreading — entry 
A(i,  j)  in  the  matrix  denotes  the  weight  on  the  edge  between  nodes  (hospitals)  i  and  j 
(e.g.,  the  weight  can  be  the  average  number  of  patient-transfers  per  day).  The  infection 
spreading  model  can  be  described  by  a  flu-like  model  with  no  immunity,  technically  the 
SI  model  ('susceptible-infected')  of  epidemiology  [AM91].  Briefly,  at  every  time-step, 
any  healthy  node  can  get  the  infection  from  one  of  its  currently  infected  neighbors.  The 
probability  of  becoming  infected  by  any  particular  neighbor  during  a  period  of  time  is 
independent  and  proportional  to  the  weight  of  the  connecting  edge.  Also,  once  infected, 
a  node  always  stays  infected.  Any  node  gets  partial  immunity  upon  getting  an  antidote. 
Any  amount  x  of  antidote  cuts  the  transmissibility  of  the  virus  by  a  factor  f  (x)  (called 
the  utility  function).  For  example,  under  function  f  (x)  =  0.5X,  each  additional  antidote 
to  hospital  i  decreases  the  probability  of  transmission  from  any  neighbor  j  of  i  by  a 
fixed  percentage  (50%).  Our  results  hold  for  any  utility  function  f  (x)  with  a  diminishing 
marginal  returns  property  typical  of  infection-control  techniques  (c.f.  [ZIM+09]).  Also 
note  the  inherent  asymmetric  nature  of  the  effect  of  an  antidote,  it  only  effects  the  incoming 
edges  of  any  node.  The  infection  starts  with  some  initially  infected  'seed'  nodes.  We 
want  to  distribute  the  antidotes  so  that  the  expected  "footprint"  (the  expected  number  of 
infections  at  some  given  time  t)  is  minimized.  To  summarize,  we  are  given: 

•  The  SI  model  as  the  virus-spreading  process 

•  A  fixed  directed  weighted  graph  A  (each  edge  e  having  weight  we  with  0  <  we  ^  1 
e.g.,  the  weight  can  be  the  average  number  of  patient-transfers  per  day  between 
hospitals) 

•  A  total  of  k  antidotes  having  partial  effect  e.g.,  bottles  of  disinfectant 

•  A  weakening  function  f  (x),  denoting  how  beneficial  are  x  units  of  antidote,  typically 
with  diminishing  marginal  returns  property 

Using  popular  epidemiological  assumptions,  we  assume  that  the  virus  and  the  underly¬ 
ing  graph  do  not  change.2  The  problem  can  be  stated  as: 

Problem  7.1  (MAX-HEALTH).  Distribute  the  antidotes  such  that  for  an  infection  process 
spreading  over  the  resulting  graph  after  the  antidote-allocation,  we  minimize  the  "footprint"  (the 
expected  number  of  infections  at  time  t ,/or  some  given  t). 

The  current  practice  in  allocating  varying  amounts  of  antidote  across  a  network  is 
effectively  uniform,  e.g.  hospitals  independently  tackle  infection  control.  However,  this 

2Relaxing  these  assumptions  is  a  promising  research  direction. 
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makes  no  use  of  the  connected  network  we  are  given.  As  mentioned  in  the  introduction, 
another  obvious  but  computationally  prohibitively  expensive  method  is  to  estimate 
the  footprint  through  computer  simulations.  How  can  we  get  a  practical  and  effective 
algorithm? 

7.3.1  Our  proposed  problem — MIN-CONN 

Main  Idea  Our  observation  is  that  the  footprint  depends  on  the  connectivity  of  the 
underlying  network  and  as  we  show  next,  the  best  single  measure  of  connectivity  is  Aa, 
the  so-called  'largest  eigenvalue'  of  the  adjacency  matrix  of  the  network.  Roughly,  it 
describes  the  number  of  paths  between  pairs  of  nodes  in  a  graph,  discounting  for  longer 
paths.  Earlier  results  [WCWF03,  GMT05,  PCF+11]  have  also  shown  that  the  epidemic 
threshold  (maximum  virus  strength  so  that  there  is  no  epidemic)  on  unweighted,  undi¬ 
rected  graphs  depends  on  the  largest  eigenvalue  of  the  adjacency  matrix.  Instead  of 
MAX-HEAFTH,  we  then  propose  to  minimize  the  largest  eigenvalue  of  the  weighted 
adjacency  matrix  while  distributing  the  antidotes. 

Justification  Unfortunately,  note  that  unlike  other  models,  our  virus  spreading  model 
is  SI  and  thus  has  no  epidemic  threshold  -  any  initial  infection  will  eventually  infect 
everyone  in  the  graph.  But  still,  as  our  next  lemma  shows,  we  can  upper-bound  the 
expected  number  of  infected  nodes  in  the  graph  at  any  time  t: 

Lemma  7.1.  In  the  SI  virus  spreading  model  on  a  graph: 

cr(t)  ^  (1  +  AA)to-(0) 

inhere  cr(t)  is  the  expected  num.  of  infected  nodes  at  time  t  >  0  and  erf  0 )  is  a  scalar  depending 
just  on  the  initial  conditions  (independent  oft). 

Proof.  Suppose  the  discrete-time  SI  process  is  running  on  graph  A  and  pt  (t)  denotes  the 
probability  that  node  i  is  infected  at  time  t  after  the  process  started.  Then, 

Pi(t  +  1)  =Pi(t)  +  (1  —  Pi(t))  •  Vi  (7.1) 

where  Pi  is  the  probability  that  node  i  receives  some  infection  from  any  of  its  infected 
neighbors  during  the  time  t  to  t  +  1 .  Let  R  be  an  indicator  random  variable  for  the  event 
that  node  i  gets  the  infection  during  t  to  t  +  1.  Clearly, 

Ujeneighbor(l)  T 

where  T,  is  the  event  that  node  j  transferred  an  infection  between  time  t  and  t  +  1;  li  j  ft) 
is  the  corresponding  indicator  random  variable.  Using  the  well-known  relation  that 
expectation  of  an  indicator  random  variable  is  just  the  p.d.f.  of  the  random  variable: 

ri=E[R]  =  E^UjeneighbortijTj] 

N  N 

<  y  E[ij(t)]  =  y  A(j, i)pj  (t) 

i=i  j=i 
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where  the  second  step  follows  because  for  any  two  events  A  and  B,  Iaub  =  1a  +  1b  — 
IaIb  =>•  E[1Aub]  ^  E[1a]  +  E[1b].  Thus  using  Equation  7.1  and  above: 

N 

Pt(t  + 1)  ^  Pi(t)  +  (l  -  Pi(t))  Y_  a0a)P) U) 

J=1 

Letting  P(t)  =  [pi(t),p2(t)/ . . .  ,pN  (t)]T,  we  can  write  the  entire  system  as: 

P(t  +  1)  C  P(t)  +  [I  —  diag(P(t))]  x  At  x  P(t) 

=  P(t)+ATP(t)-diag(P(t))ATP(t) 

C  (I  +  AT)P(t) 

C  (I  +  AT)tP(0) 

Consider  the  all  ones  vector  e.  Then  for  any  t  >  0,  e 1 P  f  t  )  =  aft),  the  expected  number 
of  infected  nodes  at  time  t.  Hence, 

o-(t  +  l)  C  eT(I  + AT)tP(0) 

N 

=  ?T(ji(i+AWmT)p(o) 

i=i 

n 

<  (1  +  AA/i)teT(^_  vtul)P(0) 
j=i 

where  we  used  the  spectral  decomposition  of  matrix  I  +  AT  in  the  second  step.  Denoting 
Aa,i  as  Aa,  we  have  that 

cr(t  +  1)^(1  +  Aa)1ct(0) 

where  tr(0)  =  eT(^Jx=1  vtul)P(O)  (a  scalar  depending  just  on  the  initial  conditions  inde¬ 
pendent  of  t).  □ 

Thus,  we  propose  to  minimize  the  upper-bound  on  the  expected  number  of  infected 
nodes  at  any  time  t,  by  minimizing  the  largest  eigenvalue  Aa. 

We  call  our  proposed  problem  MIN-CONN.  Suppose  the  vector  which  gives  us 
the  distribution  for  k  antidotes  is  m  =  {mi,  m2, . . . ,  tun}  (where  mi  is  the  number  of 
antidotes  given  to  node  i),  with  the  constraint  that  mi  =  h.  Denoting  A'  as  the 
resulting  adjacency  matrix  after  distributing  the  antidotes,  our  transformed  problem  can 
be  stated  as: 

Problem  7.2  (MIN-CONN).  Distribute  the  antidotes  such  that  the  largest  eigenvalue  of  the 
resulting  adjacency  matrix  is  minimized  i.e. 

minimize  AA'  s.t.  L  mi  =  k,  Vi  mi  e  {0,1,..} 

i 

It  is  easy  to  see  that  if  we  define  a  matrix  F  =  diag(f  (m)),3  then  A'  =  A  x  F. 

3f(m)  just  applies  the  function  f  on  each  element  of  the  vector  frt 
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7.3.2  MIN-CONN  is  NP-complete 

Unfortunately,  MIN-CONN  is  NP-complete.  Consider  the  decision  version  of  MIN- 
CONN: 

Problem  7.3  (MIN  -CONN  Decision  Version).  Given  a  directed  and  weighted  graph  G  = 
(V,  E),  k  ^  0,  t  ^  0,  and  non-increasing  f(x)  (hence,  instance  (G,k,t,  f(x)))  is  there  an 
assignment  m  with  Y_i  m-i  =  k,  Vi  mi  e  Z*  such  that  Aa?  ^  t  where  A  is  the  adjacency  matrix 
ofG  and  F  =  diag(f(m))? 

We  will  prove  MIN-CONN  (Decision  version)  is  NP-complete  next. 

Theorem  7.1.  MIN-CONN  (Decision  Version)  is  NP-complete. 

Proof.  Clearly,  MIN-CONN  (Decision  Version)  is  in  NP:  given  an  integral  assignment  fft 
as  witness,  we  can  check  in  poly-time  if  the  largest  eigenvalue  is  less  than  the  threshold. 
Hence  we  just  need  to  prove  that  it  is  poly-time  reducible  from  an  NP-complete  problem. 
We  reduce  from  INDEPENDENT-SET,  a  well-known  NP-complete  problem  [GJ83]. 

Problem  (INDEPENDENT-SET).  Given  a  undirected,  unweighted  graph  G  =  (V,  E)  and  a 
number  k  >  0  (i.e.  instance  (G,  k)),  is  there  a  set  of  k  vertices,  no  two  of  which  are  adjacent? 

Say  the  size  of  G  is  n.  Given  an  instance  of  INDEPENDENT-SET  (G,  k)  we  create  an 
instance  I  =  ( G,  n  —  k,  0,  f  (x) )  of  MIN-CONN  where  f  (x)  is  defined  as 


f(x)  = 


if  x  =  0 
if  x  >  0 


Note  that  such  a  f  (x)  forces  any  algorithm  for  MIN-CONN  to  essentially  choose  k  vertices 
whose  all  incoming  edges  will  be  deleted.  Clearly  this  construction  takes  polynomial 
time.  We  now  need  to  prove  two  things: 

1.  If  there  is  an  independent  set  S  in  G,  the  instance  I  has  a  YES  answer. 

This  is  true,  because  we  can  set  Tru  =  1  for  all  n  —  k  nodes  i  not  in  S  (i.e.  V  \  S).  Consider 
the  resulting  graph  G ' .  There  will  not  be  any  edges  from  vertices  in  S  to  any  other  vertex. 
Also,  there  will  not  be  any  edges  from  vertices  in  set  V  \  S  to  each  other.  These  follow 
because  of  the  antidote  distribution  and  the  fact  that  S  was  an  independent  set  for  G . 
Hence,  the  adjacency  matrix  AE  of  G'  will  look  like: 

»  r  On— k,n— k  C 

0k,n-k  0k/k_ 

where  C  is  a  size  (n  —  k)  x  k  matrix  representing  the  edges  from  V  \  S  to  S.  It  is  easy  to 
check  that  the  largest  eigenvalue  of  AE  is  0.  Hence  I  has  a  YES  answer. 

2.  If  G  does  not  have  an  independent  set  of  size  k,  then  instance  I  has  a  NO  answer. 
Suppose  the  algorithm  for  MIN-CONN  selects  n  —  k  vertices  whose  all  incoming  edges 
will  be  deleted.  Call  the  un-selected  vertices  set  S  (|S|  =  k)  and  the  resulting  graph 
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G'  (adjacency  matrix  AF).  Consider  Gs  and  G(,  the  subgraph  induced  by  the  vertices 
of  S  in  G  and  G'  respectively  Clearly  Gs  =  G(,  as  the  algorithm  didn't  select  any 
vertex  in  S.  Also,  as  G  does  not  have  an  independent  set  of  size  k,  Gs  is  not  a  null 
graph  (with  no  edges)  and  thus  has  some  connected  sub-graph  H.  Applying  the  Perron- 
Frobenius  theorem  [McCOO],  the  largest  eigenvalue  of  the  adjacency  matrix  for  H  is 
positive.  Denoting  the  adjacency  matrix  of  Gg  (or  Gs)  as  D,  the  matrix  AF  will  look  like: 

,r  _  On— k,n— k  C 

~  [  0k,n_k  D 

where  like  before  C  is  a  size  (n-k)xk  matrix  representing  the  edges  from  V  \  S  to  S. 
We  know  that  the  largest  eigenvalue  of  AF  is  at  least  the  largest  eigenvalue  of  D  and  the 
largest  eigenvalue  of  D  is  at  least  the  largest  eigenvalue  of  the  adjacency  matrix  of  H 
(eigenvalue  interlacing).  Hence,  D  has  at  least  one  non-zero  eigenvalue.  Thus  for  any  S, 
the  largest  eigenvalue  of  AF  is  non-zero  and  hence  instance  I  has  a  NO  answer. 

Hence,  MIN-CONN  (Decision  version)  is  NP-complete.  □ 


7.4  Proposed  Method — Overview 

As  MIN-CONN  is  NP-complete,  we  resort  to  heuristics.  A  simple  and  intuitive  heuristic 
is  to  disrupt  the  connectivity  of  the  network  by  distributing  the  antidotes  according  to  the 
number  of  neighbors  ('degree')  of  a  hospital.  Thus  a  hospital  involved  in  a  larger  number 
of  total  patient  transfers  will  receive  more  resources  than  small  isolated  hospitals.  This 
appears  to  be  a  reasonable  approach  until  we  realize  that  this  does  not  directly  attack 
the  exact  connectivity  metric:  Aa.  For  example,  this  method  will  allocate  most  of  the 
resources  to  the  big  coastal  hospitals,  and  may  miss  out  on  a  critical  but  mid-sized  central 
hospital  acting  as  a  'bridge'  between  the  coasts.  Hence,  our  heuristic  should  directly  try 
to  optimize  the  drop  in  Aa.  Next  we  present  two  such  heuristics  in  improving  speed:  (a) 
Exhaustive,  (b)  Smart- Alloc. 

7.4.1  Algorithm  Exhaustive 

Algorithm  EXHAUSTIVE  greedily  tries  to  find  the  best  hospital  to  allocate  each  additional 
antidote  to.  Clearly,  the  best  node  is  the  one  in  the  graph  which,  when  given  the  extra 
antidote,  decreases  Aa  the  most  at  that  step.  Hence,  we  need  to  compute  the  largest 
eigenvalue  N  times  for  making  only  a  single  allocation  decision  (so  for  k  antidotes,  it  will 
be  done  k  x  N  times).  This  is  very  expensive  e.g.,  for  a  US-wide  network  of  about  2000 
hospitals,  it  took  about  a  day  to  distribute  only  1500  antidotes.  The  total  running  time 
would  O(kNE),  using  the  0(#edges)  Lanczos  algorithm  for  computing  the  eigenvalue). 
For  larger  graphs  (such  as  our  Second-Life  network),  this  would  be  too  slow  to  be 
feasible. 


119 


7.4.2  Algorithm  Smart-Alloc 

We  give  an  overview  of  our  approach  here,  and  the  theoretical  under-pinnings  in  the 
next  section. 

Best  single  allocation  Following  from  the  discussion  above,  instead,  we  can  give  each 
additional  antidote  to  the  currently  most  'important'  (central)  hospital,  with  the  hope  that 
it  is  also  the  hospital  reducing  Aa  the  most.  Fortunately,  we  can  show  that  the  measure 
of  centrality  which  allows  us  to  closely  approximate  the  drop  in  Aa  is  the  so-called 
Eigenvector  centrality  adapted  to  directed  graphs  (a  combination  of  the  so-called  'hub'- 
ness  and  'authority '-ness  scores  [Kle98]  of  each  node).  We  just  give  the  next  antidote 
to  the  hospital  which  has  the  highest  such  centrality  score  currently.  This  would  be 
faster  than  EXHAUSTIVE,  though  with  some  approximation.  Note  that  we  still  have  to 
perform  the  eigenvalue  computation  (to  update  the  centralities  of  all  the  nodes)  after 
each  allocation  decision.  Can  we  do  better? 

Batched  allocation  The  answer  is  yes  -  in  fact,  we  can  make  r  times  fewer  updates 
(for  a  suitably  chosen  r)  to  node  centralities  by  carefully  allocating  r  antidotes  in  one  go, 
using  only  the  old  centrality  values.  Thus  we  need  to  update  and  'resync'  the  centralities 
only  every  r  allocations.  We  call  this  algorithm  SMART-ALLOC:  it  is  much  faster  (linear 
on  number  of  nodes  and  edges)  than  the  other  methods  with  minimal  loss  of  accuracy  at 
the  same  time  (a  point  we  illustrate  using  experiments  as  well — see  Sections  7.6.2  and 
7.6.3). 


7.5  Proposed  Method — Theorems  and  proofs 

Here  we  give  details  of  the  two  main  ideas  we  mentioned  above.  Jumping  ahead,  our 
effective  and  efficient  algorithm  SMART-ALLOC  is  given  in  §  7.5.2. 


7.5.1  Best  single  allocation — Details 

Let  u  =  [ui,u2, . . .  ,uN]T  and  v  =  [vi,  v2, . . . ,  vN]T  be  right  and  left  eigenvectors  of  A 
corresponding  to  Aa.  In  a  nutshell,  the  best  node  i*  to  give  a  single  antidote  is  the  one 
with  the  maximum  value  of  U|Vi  i.e.  i*  =  arg  maxt  UiVL.  We  can  prove  the  following 
lemma  to  justify  it. 


Lemma  7.2.  Assuming  the  current  adjacency  matrix  is  A,  the  change  in  the  in  the  largest 
eigenvalue  AAa  after  distributing  one  antidote  to  a  node,  say  i,  approximated  to  the  first  order  is 
given  by: 


AAa  =  Aa 


(  f(l)UtVi 
V  vTu 


1 
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Proof.  We  know  that  Au  =  Aau  and  vTA  =  Aavt  (right  and  left  eigenvectors).  Ac¬ 
cording  to  the  Perron-Frobenius  theorem  [McCOO],  Aa  is  real  and  non-negative  and  the 
components  of  the  corresponding  eigenvectors  v  and  u  all  are  positive.  After  a  small 
modification  due  to  the  medicine: 


(A  -|-  AA)  (u  A  Au)  —  (Aa  A  AAa)  (u  A  Au) 

Premultiplying  by  vT  and  neglecting  second  order  terms: 


AA; 


vtAAu 


vTu 


Clearly,  after  distributing  one  antidote  to  node  i,  AA  is: 

AA  =  AFt  -  A 


(7.2) 


(7.3) 


where  Ft  =  diag([f(0), . . . ,  f(l), . . . ,  f  (0)] )  (i.e.  the  i-th  position  on  the  diagonal  is  f(l)). 
Using  it  in  Equation  7.2: 


AAa 


vtAF|U 

vTu 

AavtFu 

vTu 


vTAu 

vTu 

Aa 


Aa 


f  f(l)UiVi 

V  vTu 


(7.4) 


Proved. 


□ 


This  requires  the  computation  of  u  and  v,  which  is  0(F).  We  can  continue  giving  the 
antidotes  in  this  way,  but  as  discussed  above,  we  will  need  to  re-compute  u  and  v  after 
each  allocation  decision. 


7.5.2  Batched  allocation — Details 

In  sum,  Smart-Alloc  uses  Algorithm  2  to  batch-allocate  and  resync  till  we  have 
exhausted  total  budget  k  (see  §  7.5.2).  We  now  show  how  we  can  batch-allocate  r 
antidotes  in  one-go.  Suppose  the  distribution  of  allocations  as  before  is  given  by  the 
vector  m.  In  this  case,  we  can  prove  the  following  lemma,  similar  to  Lemma  7.2. 

Lemma  7.3.  The  change  in  the  largest  eigenvalue  AAa  after  distributing  r  antidotes  according 
to  m  (so  Y.i  ttu  =  U  approximated  to  the  first  order  is  given  by: 

AAa=Aa(FC-1) 

V  1  U 

where  v  and  uare  the  left  and  right  eigenvectors  of  A  corresponding  to  Aa  and  F  =  diag(f(m)). 


121 


Proof.  (Details  Omitted  for  brevity)  The  main  change  from  Lemma  7.2  is  that  AA  =  AF— A 
now.  □ 

Subsequently,  for  the  best  allocation  of  r  antidotes,  it  is  easy  to  see  that  we  have  the 
following  optimization  problem  now,  analogous  to  MIN-CONN: 

Problem  7.4  (MAX-DROP).  Distribute  antidotes  such  that: 

N 

minimize  ^  f(rrq)  ■  u^  •  Vt  s.t.  ^  mt  =  r 

i=l  i 

(of  course,  Vt  mt  e  {0, 1, ..}).  Clearly,  it  is  an  integer  optimization  problem,  which  in 
general  is  NP-complete. 

GreedyDrop:  An  optimal  greedy  algorithm 

Surprisingly,  we  can  prove  that  a  greedy  algorithm  achieves  the  optimal  solution  for 
MAX-DROP,  when  f(x)  is  monotone  non-increasing  convex.  The  algorithm  is  given  in 
Algorithm  2. 

Algorithm  2  GreedyDrop 

Input:  Directed  Weighted  Adjacency  matrix  A,  batch-size  r,  monotone  non-increasing 
convex  function  f(x) 

1:  u  =  first  right  eigenvector  of  A 
2:  v  =  first  left  eigenvector  of  A 
3:  frt  =  0 

4:  for  i  =  1  to  r  do 

5:  j  =  maxH  [f(mH)  -  f  (mh  +  l)]uHvH 

6:  rrij  =  rrij  +  1 

7:  end  for 
8:  return  m 


Intuitively,  at  each  iteration,  we  pick  the  index  (node)  j  which  maximizes  the  drop 
in  the  value  of  the  objective  at  that  step.  Clearly,  the  running  time  of  the  algorithm  is 
0(L  +  kN).  We  prove  the  optimality  of  GreedyDrop  in  Theorem  7.2.  First,  we  prove  the 
following  lemma: 

Lemma  7.4.  Given  a  convex  non-increasing  function  f(x),  define  function  g(x)  =  f(x)— f(x+l). 
Then  g(x)  is  non-increasing. 

Proof.  As  f (x)  is  monotone  non-increasing  and  convex,  from  the  property  of  convex 
functions: 

f(x)  -f(y)  >  f'(y)[x-y]  Vx,y  (7.5) 
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Using  Equation  7.5  with  x  =  x,y  =  x  +  1  and  x  =  x  +  l,y  =  x,  we  get: 

— f'(x  +  1)  ^  g(x)  ^  —  f'(x) 

Similarly, 

— f'(x  +  2)  ^  g(x  +  1)  ^  —  f'(x  +  1) 

Clearly,  from  the  preceding  inequalities,  we  have  that  Vx  g(x  +  1)  ^  g(x)  i.e.  g(x)  is  a 
non-increasing  function.  □ 

Theorem  7.2.  GreedyDrop  returns  the  optimal  integral  mfor  MAX-DROP  when  f fx)  is 
monotone  non-increasing  and  convex. 

Proof.  Say  GreedyDrop  returns  mG  as  the  answer,  but  m*  is  the  true  optimal.  Then  there 
was  some  first  step  (say  t)  where  we  incremented  some  ny  from  Sj  to  Sj  +  1  in  mG  but 
m*  has  rrij  =  s,.  Because  we  have  a  fixed  batch-budget  r,  m*  also  has  some  mk  as  sk  +  1 
while  mG  has  mk  which  is  at  most  sk. 

Consider  another  assignment  m'  which  is  identical  to  m*  except  mk  =  sk  and  m,  = 
Sj  +  1.  Note  that  we  are  still  satisfying  our  constraint  and  hence  it  is  a  valid  assignment. 
The  score  of  this  assignment  is: 

N 

Score  (m/)  =  f(rru)  •  Uj  •  Vj 

i=l 

=  Score  (m*)  +  [f(sk)  -  f(sk  +  l)]ukvk 

— [f(sj)  -f(sj  +l)]ujVj  (7.6) 

where  the  last  step  is  due  to  the  construction  of  m/. 

Recall  that  while  computing  mG,  GreedyDrop  had  chosen  j  at  step  t  i.e., 

j  =  maxh  [f(m.h)  -  f(mh  +  l)]uHvH 

at  step  t.  At  that  instant,  suppose  mk  =  s(..  Hence  from  the  above  equation  we  can 
conclude  that: 

[f (sk)  -  f(sk  +  l)]ukVic  ^  [f (sj )  -  f (sj  +  l)]UjVj  (7.7) 

In  addition,  we  know  that  s(.  C  sk.  But  from  Lemma  7.4,  g(s(J  ^  g(sk)  i.e. 

fK)  -  f(sk  +  1)  >  f(sk)  -  f(sk  +  1)  (7.8) 

So,  from  Equations  7.7  and  7.8: 

[f(sk)  -  f(sk  +  l)]ukvk  ^  [f (Sj )  -  f (Sj  +  l)]UjVj 

Coupled  with  Equation  7.6,  the  above  inequality  implies  that  Score (m/)  C  Score (m*).  If 
Score  (m')  <  Score (m*),  then  m*  is  not  optimal  as  we  started  with  the  assumption  that 
m*  is  optimal  and  hence  has  the  lowest  score.  If  Score  f  m/)  =  Score  f  ird ),  then  we  can 
conclude  that  GreedyDrop  did  not  make  an  error  at  step  t  and  made  it  at  some  other  point. 
Continuing  similarly,  finally,  either  m*  is  not  optimal  or  GreedyDrop  is  correct.  Hence,  a 
contradiction,  mG  is  optimal  and  GreedyDrop  gives  the  optimal  integral  answer.  □ 
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Smart-Alloc 


Finally,  we  are  ready  to  describe  our  algorithm  Smart-Alloc:  use  GreedyDrop  (Algo¬ 
rithm  2)  to  batch-allocate  a  small  number  (r)  of  resources  and  then  're-sync'  (re-compute) 
the  first  left  and  right  eigenvectors  and  continue  similarly  till  our  budget  k  is  exhausted. 

One  may  ask  why  can  not  we  directly  allocate  all  k  antidotes  in  one-go  using  Greedy¬ 
Drop ?  This  is  because,  unfortunately,  the  accuracy  of  the  first-order  approximation  in 
Lemma  7.3  is  only  good  when  the  number  of  antidotes  k  is  small  w.r.t.  the  graph  i.e 
when  k<N.  But  that  is  not  the  case  in  general  -  for  e.g.  in  our  motivating  problem  one 
may  want  to  distribute  200  infection  control  resources  among  2000  nationwide  hospitals. 
In  fact,  k  can  be  arbitrarily  high,  since  the  units  of  resource  in  this  problem  are  set  with 
arbitrary  granularity.  It  is  easy  to  see  the  next  lemma: 

Lemma  7.5  (Running  time  of  Smart-Alloc).  The  running  time  of  the  algorithm  SMART- 
Alloc  is  0(kE/r  +  kN). 

Clearly,  we  want  to  use  as  large  r  as  possible.  Our  proposed  rule-of-thumb  is  to 
choose  r  proportional  to  the  spectral-gap  (|Aa|  —  |AA/2|)  of  the  graph.  Larger  the  spectral- 
gap,  lesser  is  the  sensitivity  of  the  spectrum  of  A  [GVL89],  lesser  is  the  need  to  re-sync 
often  and  hence  larger  is  the  r  we  can  use  e.g.  in  our  experiments  on  hospital  graphs, 
which  had  a  small  spectral-gap,  we  found  that  r  —  6  performed  very  well. 


7.6  Experiments 

We  designed  experiments  to  answer  the  following  questions  about  our  algorithm  SMART- 
Alloc:  (i)  Effectiveness  for  reducing  the  rate  of  infection,  (ii)  Effectiveness  for  MIN- 
CONN  and  (ii)  Scalability.  In  short,  Smart-Alloc  proves  to  be  a  fast  and  effective 
algorithm  for  both  reducing  the  rate  of  infection  and  solving  MIN-CONN  and  is  very 
close  to  EXHAUSTIVE,  at  a  fraction  of  the  running  cost,  while  others  are  much  worse. 


7.6.1  Setup 

For  answering  the  above  questions  we  ran  extensive  simulation  experiments  and  com¬ 
pared  against  many  other  resource  allocation  methods  (see  Table  7.1)  on  multiple  real- 
world  datasets  (see  Table  7.2).  We  ran  parallel  experiments  on  a  Condor  [TTL05]  cluster 
of  58  cores  each  being  a  generic  Fedora  7  machine.  All  the  algorithms  and  the  SI  infection 
process  were  coded  in  C++.  We  use  f  (x)  =  0.50x  and  r  =  6  for  all  our  experiments.  The 
choice  of  the  function  f  (x)  captures  the  diminishing  marginal  utility  of  infection  control 
based  on  a  wide-range  of  studies  in  the  medical  literature  of  existing  infection  control 
techniques  (c.f.  [ZIM+09]). 
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Table  7.1:  Various  Algorithms  used  for  comparison 


Method  Name 

Method  Description 

Speed  O(-) 

Uniform 

Distribute  uniformly  among  the  nodes,  breaking  ties 
randomly 

kN 

Degree 

Distribute  randomly  proportional  to  the  'degree'!  of  the 
nodes. 

E  +  kN 

Exhaustive 

Allocate  each  additional  antidote  to  that  node  which 
decreases  the  largest  eigenvalue  Aa  the  most  in  that 
step. 

kEN 

Smart-Alloc 

Allocate  r  antidotes  in  one  go  based  on  node  centralities 
and  only  then  recompute. 

kE/r  +  kN 

f  As  the  graphs  are  directed,  we  use  degree  centrality  [Fre79]  -  geometric  mean  of  in¬ 
degree  (the  number  of  transfers  the  hospital  receives)  and  out-degree  (the  number  of 
transfers  the  hospital  sends  out). 


Table  7.2:  Various  real-world  datasets  used  in  our  work 


Dataset  Name 

Nodes  (N) 

Edges  (E) 

Description 

US -MED I CARE 

2138 

102411 

All  critical  patient  transfers  among  US 
hospitals  based  on  all  Medicare  Provider 
Analysis  and  Review  (MedPAR)  final  ac¬ 
tion  claims  between  Sept.  1,  2004  -  Sept.  1, 
2005  [ICM+09], 

PENN-ALL 

137 

11211 

Critical  patient  transfers  within  Pennsyl¬ 
vania  hospitals  based  on  all  discharges 
(not  just  Medicare)  between  July  1,  2004  - 
June  30,  2006  [ICM+09], 

GESTURE 

166,774 

1.5  million 

Second-Life  transfer-network  of  'gestures' 
among  users.  Gestures  can  include  any¬ 
thing  from  animation,  chat  to  playing 
sounds. 

|  Weight  for  each  edge  u  — >•  v  was  the  average  number  of  transfers  from  hospital  u  to  v 
per  day. 
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7.6.2  Effectiveness  for  MIN-CONN  problem 


US-MEDICARE  L.rgest  Eig.nvalu.  vs  Rssours.s  penn-ALL  Larges.  Eigenvalue  »s  resources 


(a)  US-MEDICARE 


(b)  PENN- ALL 


Figure  7.2:  Largest  Eigenvalue  after  allocation  vs  Budget  k  of  resources  used  for  different  algo¬ 
rithms.  (a)  US-MEDICARE  Network  (b)  PENN-ALL  Network.  Lower  is  better  and  SMART-ALLOC 
is  near-optimal  in  both  cases,  (plots  use  color) 

MIN-CONN  aims  to  decrease  the  largest  eigenvalue  the  most  -  how  do  the  algorithms 
perform  in  that  measure?  In  short,  SMART-ALLOC  comes  very  close  to  EXHAUSTIVE  while 
others  are  much  worse.  Figure  7.2  shows  the  largest  eigenvalue  of  the  resulting  graph 
after  giving  k  antidotes  according  to  various  algorithms  vs  k  on  the  US-MEDICARE 
and  PENN-ALL  networks.  UNIFORM  and  DEGREE  perform  poorly,  although  DEGREE 
is  better  (sometimes  marginally)  than  UNIFORM.  SMART-ALLOC  and  EXHAUSTIVE  are 
much  better  at  achieving  the  lowest  eigenvalue  for  all  k.  EXHAUSTIVE  is  expected  to  be 
near-optimal  as  it  does  an  exhaustive  search  via  repeated  eigenvalue  computation  for 
the  node  which  decreases  the  eigenvalue  the  most.  On  the  other  hand,  SMART-ALLOC 
performs  well  due  to  our  careful  approximation  and  algorithm-design. 


7.6.3  Effectiveness  for  MAX-HEALTH  problem 

We  ultimately  want  to  test  how  the  algorithms  perform  for  MAX-HEALTH.  In  short, 
again,  Smart-Alloc  proves  to  be  an  effective  algorithm  and  is  very  close  to  EXHAUS¬ 
TIVE  while  others  are  much  worse.  See  Figures  7.3  and  7.4  -  they  show  the  expected 
number  of  infected  nodes  (hospitals)  vs.  time  tick  after  running  the  infection  process 
on  the  partially  immunized  US-MEDICARE  and  PENN-ALL  networks  for  different 
budget  k  of  antidotes.  The  different  curves  represent  the  different  algorithms  used  for 
allocation.  As  the  edge-weights  represent  the  average  number  of  transfers  per  day,  the 
curves  represent  the  average  footprint  for  each  day  after  the  infection  starts.  Each  curve 
is  an  average  of  21380  and  1370  simulation  runs  for  US-MEDICARE  and  PENN- ALL 
respectively  -  in  this  way  we  ensured  that  we  seeded  the  infection  from  each  hospital  for 
10  different  runs.  We  ran  the  simulations  till  365  time-ticks  (=  1  year)  and  took  the  average 
over  all  runs  for  each  time-tick.  Finally,  the  range  of  values  of  k  for  US-MEDICARE  and 
PENN-ALL  were  chosen  according  to  the  network  sizes  and  the  function  f  (x)  =  0.50x. 
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US-MEDICARE  100  resources  Infected  vs  Time 


(a)  k  =  100 


US-MEDICARE  150  resources  Infected  vs  Time 


(b)  k  =  150 


US-MEDICARE  200  resources  Infected  vs  Time 


(c)  k  =  200 


Figure  7.3:  US-MEDICARE  network  for  different  algorithms  and  budget  k  of  resources:  Expected 
Number  of  Infections  vs  Time  ticks  («  days).  Again  EXHAUSTIVE  and  SMART- ALLOC  perform 
the  best  and  are  close  to  each  other,  as  expected.  Each  curve  average  of  21380  runs  and  lower  is 
better  (plot  uses  color) 


PENN-ALL  75  resources  Infected  vs  Time 


(a)  k  =  75 


PENN-ALL  100  resources  Infected  vs  Time 


(b)  k  =  100 


PENN-ALL  120  resources  Infected  vs  Time 


(c)  k  =  120 


Figure  7.4:  PENN- ALL  network  for  different  algorithms  and  budget  k  of  resources:  Expected 
Number  of  Infections  vs  Time  ticks  («  days).  Again  EXHAUSTIVE  and  SMART- ALLOC  perform 
the  best  (they  are  almost  on  top  of  each  other),  as  expected.  Each  curve  average  of  1370  runs  and 
lower  is  better  (plot  uses  color) 


Our  algorithm  SMART-ALLOC  clearly  is  very  close  to  EXHAUSTIVE  and  has  the  lowest 
footprints  everyday  compared  to  the  rest.  For  e.g.,  in  Figure  7.3(f),  after  an  year  with 
k  =  200  antidotes,  EXHAUSTIVE  and  Smart-Alloc  have  an  expected  total  of  42  and  46 
hospitals  infected,  while  the  other  methods  end  with  about  2.5  times  worse  at  around 
110.  It  is  even  more  pronounced  in  PENN- ALL  (Figure  7.4(f)):  after  an  year  with  k  =  120, 
Exhaustive  and  Smart-Alloc  have  an  expected  footprint  of  ~  8,  while  the  next  closest 
method  is  about  3  times  worse  at  around  23.  This  shows  the  dramatic  impact  an  effective 
allocation  algorithm  can  have  on  the  number  of  infected  nodes.  Moreover  note  that 
all  algorithms  essentially  mimic  their  performance  w.r.t.  MIN-CONN  (Figure  7.2)  i.e. 
larger  the  corresponding  drop  in  the  first  eigenvalue  Aa,  lower  is  the  number  of  expected 
infections,  validating  our  reduction  of  MAX-FIEALTFI  to  MIN-CONN. 

The  current  practice  is  for  each  hospital  to  independently  manage  infection  control. 
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which  may  be  no  better  from  the  network  perspective  than  using  UNIFORM.  But  note 
that  compared  to  UNIFORM,  SMART- Alloc  can  be  up  to  6  times  better  (see  Figure  7.4(f)). 
Interestingly,  for  the  US-MEDICARE  network,  we  found  that  to  achieve  the  same  level 
of  infection  control  as  Smart-Alloc  and  k  =  120,  we  need  a  budget  of  about  k  =  800 
resources  if  distributed  according  to  UNIFORM. 

7.6.4  Scalability 

As  discussed  before,  Smart-Alloc  is  much  faster  than  its  chief  competitor  EXHAUSTIVE 
(see  Table  7.1).  For  example,  it  took  more  than  10  hours  to  distribute  200  resources 
using  EXHAUSTIVE  on  the  US-MEDICARE  network  while  it  took  just  ~  14  seconds  to  run 
Smart-Alloc  for  the  same  budget  -  a  2500x  speed-up.  As  a  further  comparison,  the 
naive  simulation-based  algorithm  ran  for  a  week  and  still  had  not  finished  for  the  same 
budget  -  a  more  than  30,  OOOx  speed-up.  Additionally,  on  the  GESTURE  network,  we  had 
to  stop  Exhaustive  after  it  took  3  days  to  allocate  a  single  resource;  Smart-Alloc  took 
~  150  mins  to  allocate  2000  resources. 

7.6.5  Generality 


Time  ticks  (days) 


Figure  7.5:  Expected  number  of  infections  vs  time-ticks  for  different  algorithms,  budget  k  =  2000 
on  the  GESTURE  network.  Smart-Alloc  is  the  best.  Each  curve  average  of  1000  runs,  (plot 
uses  color) 

As  mentioned  in  §  7.1,  although  our  problem  was  originally  motivated  on  hospital- 
transfer  networks,  the  problem  of  fractional  immunization  arises  in  many  other  scenarios, 
and  is  arguably  more  realistic  than  complete  immunization.  To  demonstrate  the  utility 
of  Smart-Alloc  in  domains  other  than  epidemiology,  we  also  compare  performance 
on  the  GESTURE  network  of  asset  transfers  between  virtual  world  users.  Figure  7.5 
shows  the  the  expected  number  of  infected  users  vs.  time,  if  a  malicious  asset  were  to 
be  created  and  transferred  between  users.  We  budget  k  =  2000  antidotes,  and  select  the 
infection  source  randomly.  We  don't  show  EXHAUSTIVE  because,  as  mentioned  before,  it 
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didn't  complete  even  after  3  days  whereas  Smart- Alloc  allocated  all  2000  resources  in 
~  150  mins.  As  expected,  SMART- ALLOC  has  the  fewest  users  infected,  while  others  have 
up  to  ~  2.5  times  more  users  infected,  demonstrating  the  efficacy  of  our  algorithm  in  a 
completely  different  domain. 


7.7  Conclusion 

This  work  is  the  first  to  address  the  problem  of  allocation  of  infection-control  resources 
with  fractional  and  asymmetric  impact  among  nodes  in  a  network.  It  is  a  more  general 
problem  than  that  of  selecting  a  subset  of  nodes  to  be  immunized  completely  via  a 
vaccine.  The  potential  applications  are  broad — from  curbing  spread  of  infection  between 
hospitals  from  patient  transfers,  to  preventing  spread  of  malicious  code  in  virtual  world 
settings. 

We  formulated  the  problem,  proved  it  is  NP-complete,  and  gave  a  highly  efficient 
and  effective  algorithm  Smart-Alloc,  which  also  we  demonstrated  through  extensive 
experiments  on  multiple  real-world  datasets,  including  nation-wide  patients-transfer 
networks  and  electronic  virtual-world  social  transfer-networks.  SMART-ALLOC  runs  in 
seconds  (as  opposed  to  weeks),  on  commodity  hardware;  more  importantly,  applied  on 
real  hospital-transfer  networks  (2005  U.S.  Medicare  data,  2004-2006  PA  all-payer  data)  it 
results  to  up  to  6x  fewer  infections,  compared  to  current  practice  and  other  heuristics. 

The  current  practice  in  control  of  highly  resistant  organisms  via  patient  transfers 
has  been  largely  focused  within  individual  hospitals.  Hence,  the  current  public  health 
policy  is  missing  an  opportunity  to  significantly  reduce  infection  rates  with  an  infection 
prevention  strategy  that  accounts  for  the  potential  transfer  of  bacteria  along  the  network 
of  inter-hospital  patient  transfers. 
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Chapter  8 

General  Edge  Placement 


In  this  chapter,  we  shift  the  problem  of  controlling  the  dissemination  of  an  entity  (e.g., 
meme,  virus,  etc)  on  a  large  graph  to  the  level  of  edges  and  ask:  which  edges  should  we 
add  or  delete  in  order  to  speed-up  or  contain  a  dissemination?  First,  we  propose  effective 
and  scalable  algorithms  to  solve  these  dissemination  problems.  Second,  we  conduct  a 
theoretical  study  of  the  two  problems  and  our  methods,  including  the  hardness  of  the 
problem,  the  accuracy  and  complexity  of  our  methods,  and  the  equivalence  between  the 
different  strategies  and  problems.  Lastly,  we  conduct  experiments  on  real  topologies  of 
varying  sizes  to  demonstrate  the  effectiveness  and  scalability  of  our  approaches. 


8.1  Introduction 

As  we  have  already  seen  in  previous  chapters,  managing  the  dissemination  of  an  entity 
(e.g.,  meme,  virus,  etc)  on  a  large  graph  is  a  challenging  problem  with  applications  in 
various  settings  and  disciplines.  In  its  generality,  the  propagating  entity  can  be  many 
different  things,  such  as  a  meme,  a  virus,  an  idea,  a  new  product,  etc.  The  propagation 
is  affected  by  the  topology  and  the  properties  of  the  entity:  its  'virality',  its  speed,  its 
'stickiness'  or  the  duration  of  the  infection  of  a  node.  Our  focus  here  is  the  topology,  since 
we  assume  that  we  cannot  alter  the  properties  of  the  propagating  entity. 

The  problem  we  address  is  how  we  can  affect  the  propagation  by  modifying  the 
edges  of  the  graph.  In  fact,  we  address  two  different  problems.  First,  in  the  NetMelt 
problem,  we  want  to  contain  the  dissemination  by  removing  a  given  number  of  edges. 
For  example,  we  can  consider  the  distribution  of  malware  over  a  social  network.  Deleting 
user  accounts  may  not  be  desirable,  but  deleting  edges  ('unfriending'  people)  may  be 
more  acceptable.  More  specifically,  we  want  to  delete  a  set  of  k  edges  from  the  graph 
to  minimize  the  infected  population.  Second,  in  the  NetGel  problem,  we  want  to 
enable  the  dissemination  by  adding  a  given  number  of  edges.  Specifically,  we  want 
to  add  a  set  of  k  new  edges  into  the  graph  to  maximize  the  population  that  adopt  the 
information.  For  example,  we  could  extend  the  social  network  scenario  using  the  recent 
'arab  spring'  which  often  used  Facebook  and  Twitter  for  coordinating  events:  we  may 
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want  to  maximize  the  spread  of  a  potential  piece  of  information.  Note  that  an  additional, 
key  requirement  for  both  problems  is  computational  efficiency:  the  solution  should  scale 
to  large  graphs. 

Both  problems  are  challenging  for  slightly  different  reasons.  For  the  NETMELT 
problem,  most  of  the  existing  methods  operate  on  the  node-level,  e.g.,  deleting  a  subset 
of  the  nodes  from  the  graph  to  minimize  the  infected  population  from  a  propagating 
virus.  In  the  above  social  spam  example,  this  means  that  we  might  have  to  shutdown 
some  legitimate  user  accounts.  Can  we  avoid  this  by  operating  on  a  finer  granularity, 
that  is,  only  deleting  a  few  edges  between  users  to  slow  down  the  social  spam  spreading? 
For  the  NetGel  problem,  things  are  even  more  challenging  because  of  its  high  intrinsic 
time  complexity.  Let  n  be  the  number  of  the  nodes  in  the  graph.  There  are  almost  n2 
non-existing  edges  since  many  real  graphs  are  very  sparse.  In  other  words,  even  if 
we  only  want  to  add  one  single  new  edge  into  the  graph,  the  solution  space  is  0(n2). 
This  complexity  'explodes'  if  we  aim  to  add  multiple  new  edges  collectively,  where  the 
solution  space  becomes  exponential.  To  date,  there  does  not  exist  any  scalable  solution  for 
the  NetGel  problem. 

The  overarching  contribution  of  this  work  is  the  formulation  and  theoretical  study  of 
the  dissemination  management  via  edge  manipulation:  how  to  place  a  set  of  edges1  to 
achieve  the  desired  outcome.  The  main  contributions  can  be  summarized  as  follows: 

•  Algorithms.  We  propose  effective  and  scalable  algorithms  to  optimize  the  leading 
eigenvalue,  the  key  graph  parameter  that  controls  the  information  dissemination 
processes  for  both  NetMelt  and  NetGel,  respectively; 

•  Proofs  and  Analysis.  We  show  the  accuracy  and  the  complexity  of  our  methods;  the 
hardness  of  the  problem,  and  equivalence  between  the  different  strategies; 

•  Experimental  Evaluations.  Our  evaluations  on  real  large  graphs  show  that  our 
methods  are  both  effective  and  scalable  (see  Fig.  8.1  as  an  example). 

The  rest  of  the  work  is  organized  as  follows.  We  introduce  notation  and  formally 
define  the  NetGel  and  NetMelt  problems  in  Section  2.  We  present  and  analyze  the 
proposed  algorithms  in  Section  3  and  Section  4,  respectively.  We  provide  experimental 
evaluations  in  Section  5.  We  review  the  related  work  in  Section  6  and  conclude  in 
Section  7. 


8.2  Problem  Definitions 

Table  8.1  lists  the  main  symbols  used  throughout  the  chapter.  We  consider  directed, 
irreducible  unipartite  graphs.  For  ease  of  presentation,  we  discuss  the  unweighted 
graph  scenario  although  the  algorithms  we  propose  can  be  naturally  generalized  to  the 
weighted  case.  We  represent  a  graph  by  its  adjacency  matrix.  Following  the  standard 
notation,  we  use  bold  upper-case  for  matrices  (e.g..  A),  bold  lower-case  for  vectors  (e.g., 

1In  this  work,  we  use  the  terms  'link'  and  'edge'  interchangeably. 
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Figure  8.1:  Comparison  of  maximizing  the  outcome  of  the  information  dissemination  process. 
Larger  is  better.  The  proposed  method  (red)  leads  to  the  largest  number  of  'infected'  nodes 
(e.gv  having  more  people  in  the  social  networks  to  adopt  a  piece  of  good  idea,  etc).  Notice 
that  all  the  alternative  methods  are  mixed  with  the  result  on  the  original  graph  (yellow),  which 
means  that  they  fail  to  affect  the  outcome  of  the  dissemination  process.  See  Section  6  for  detailed 
experimental  setting. 


a),  and  calligraphic  fonts  for  sets  (e.g.,  J).  We  denote  the  transpose  with  a  prime  (i.e..  A' 
is  the  transpose  of  A).  Also,  we  represent  the  elements  in  a  matrix  using  a  convention 
similar  to  Matlab,  e.g.,  A(i,  j)  is  the  element  at  the  ith  row  and  jth  column  of  the  matrix  A, 
and  A(:,j)  is  the  jth  column  of  A,  etc. 

When  we  discuss  the  relationship  between  the  two  different  strategies  (node  deletion 
vs.  edge  deletion)  for  the  NetMelt  problem,  it  is  helpful  to  introduce  the  concept  of  line 
graph,  where  the  nodes  represent  the  edges  in  the  original  graph.  Formally,  each  edge  in 
the  original  graph  A  becomes  a  node  in  the  line  graph  L(A);  and  there  is  an  edge  from 
one  node  to  the  other  in  the  line  graph  if  the  target  of  the  former  edge  is  the  same  as  the 
source  of  the  latter  edge  in  the  original  graph  A.  It  is  formally  defined  as  follows: 

Definition  8.1  (Line  Graph).  Given  a  directed  graph  A,  its  directed  line  graph  L(A)  is  a  graph 
such  that  each  node  of  L(A)  represents  an  edge  of  A,  and  there  is  an  edge  from  a  node  e,  to  e2  in 
L(A)  iff  for  the  corresponding  edges  (ii,  ji)  and  (i2,  j2)  in  A,  ji  =  i2. 

With  the  notation  of  the  line  graph  L(A),  we  have  two  equivalent  ways  to  represent 
an  edge.  Let  ex  (ex  =  1, ...,  m)  be  the  index  of  the  nodes  (i.e.,  the  edges  in  A)  in  the  line 
graph.  We  can  also  represent  the  edge  ex  by  the  pair  of  its  source  and  target  nodes  in  the 
original  graph  A:  (ix,  jx),  i.e.,  the  edge  ex  starts  with  the  node  ix  and  ends  at  node  jx. 

In  order  to  design  an  effective  strategy  to  optimize  the  graph  structure  to  affect  the 
outcome  of  an  information  dissemination  process,  we  need  to  answer  the  following  three 
questions.  (1)  (Key  graph  parameters /metrics)  What  are  key  graph  metrics /parameters 
that  determine /control  the  dissemination  process?  (2)  (Graph  operations)  What  types 
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Table  8.1:  Symbols 


Symbol 

Definition  and  Description 

A,  B, . . . 

matrices  (bold  upper  case) 

A(i,j) 

the  element  at  the  ith  row  and  the  jtH 

column  of  A 

A(i,:) 

the  itH  row  of  matrix  A 

A(:,j) 

the  jth  column  of  matrix  A 

A' 

transpose  of  matrix  A 

a,  b, . . . 

vectors 

sets  (calligraphic) 

A 

the  largest  (in  module)  eigenvalue  of  A 

u,v 

the  n  x  1  left  eigenvector  and  right 

eigenvector  associated  with  A. 

n 

the  number  of  the  nodes  in  the  graph 

m 

the  number  of  the  edges  in  the  graph 

k 

the  budget  (i.e.,  the  number  of  deleted  or 

added  edges) 

of  graph  operations  (e.g.,  deleting  nodes/ edges,  adding  edges,  etc)  are  we  allowed  to 
change  the  graph  structure?  (3)  (Affecting  algorithms)  For  a  given  graph  operation,  how 
can  we  design  effective,  scalable  algorithms  to  optimize  the  corresponding  key  graph 
parameters? 

For  information  dissemination  on  real  graphs,  we  have  already  seen  that  for  a  large 
family  of  dissemination  processes,  the  largest  (in  modulus)  eigenvalue  A  of  the  adjacency 
matrix  A  or  an  appropriately  defined  system  matrix  is  the  only  graph  parameter  that 
determines  the  tipping  point  of  the  dissemination  process,  i.e.,  whether  or  not  the 
dissemination  will  become  an  epidemic.  In  principle,  this  gives  a  clear  guidance  on  the 
algorithmic  side,  that  is,  an  ideal,  optimal  strategy  to  affect  the  outcome  of  the  information 
dissemination  process  should  change  the  graph  structure  so  that  the  leading  eigenvalue  A  is 
minimized  or  maximized. 

Based  on  this  observation,  now  we  can  transform  the  original  problem  of  affecting 
the  dissemination  process  to  the  eigenvalue  optimization  problem,  that  is, 

(1)  minimize  the  leading  eigenvalue  A  for  NETMELT; 

(2)  maximize  the  leading  eigenvalue  A  for  NetGel. 

In  this  chapter,  we  focus  on  operating  on  the  edge-level  to  design  affecting  algorithms. 
With  the  above  notation,  our  problems  can  be  formally  defined  as  the  following  two 
sub-problems: 

Problem  8.1.  NetMelt  (Edge  Deletion) 

Given:  A  large  n  x  n  graph  A  and  an  integer  (budget)  k; 
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Output:  A  set  of  k  edges  from  A  whose  deletion  from  A  creates  the  largest  decrease  of  the  leading 
eigenvalue  of  A. 

Problem  8.2.  NetGel  (Edge  Addition) 

Given:  A  large  n  x  n  graph  A  and  an  integer  (budget)  k; 

Find:  A  set  of  k  non-edges  of  A  whose  addition  to  A  creates  the  largest  increase  of  the  leading 
eigenvalue  of  A. 

As  we  will  show  soon,  both  problems  are  combinatorial. 


8.3  Proposed  Algorithm  for  NetMelt 

In  this  section,  we  address  the  NetMelt  problem  (Prob.  8.1),  that  is,  to  delete  k  edges 
from  the  original  graph  A  so  that  its  leading  eigenvalue  A  will  decrease  as  much  as 
possible.  We  first  study  the  relationship  between  two  different  strategies  (edge  deletion 
vs.  node  deletion),  and  then  present  our  algorithm,  followed  by  the  analysis  of  its 
effectiveness  as  well  as  efficiency. 

8.3.1  Edge  Deletion  vs.  Node  Deletion 

Roughly  speaking,  in  the  NETMELT  Problem  (Edge  Deletion),  we  want  to  find  a  set  of  k 
'important'  edges  from  the  graph  A  to  delete.  With  the  notation  of  the  line  graph  L(A), 
intuitively,  such  'important'  edges  in  A  might  become  'important'  nodes  in  the  line  graph 
L(A).  In  this  section,  we  briefly  present  the  relationship  between  these  two  strategies 
(node  deletion  vs.  edge  deletion). 

Our  main  result  is  summarized  in  Lemma  8.1,  which  says  that  the  eigenvalues  of  the 
original  graph  A  are  also  the  eigenvalues  of  its  line  graph  L(A). 

Lemma  8.1.  Line  Graph  Spectrum.  Let  A  be  an  eigenvalue  of  the  graph  A.  Then  A  is  also  the 
eigenvalue  of  the  line  graph  L(A). 

PROOF.  Let  u  and  v  be  the  left  and  right  eigenvectors  of  the  graph  A  that  correspond 
to  any  eigenvalue  A.  We  have  Av  =  Av  and  u'A  =  Au'.  Equivalently,  we  have: 

Av(i)  =  Y_  v9)  and  Au9)  =  Y-  UW  (8d) 

j:A(i,j)=l  i:A(i,j)=l 

Let  ex  (ex  =  1, ...,  m)  be  an  edge  of  the  graph  A.  Recall  that  we  can  also  represent 
ex  by  its  source  and  target  nodes:  (ix,jx).  Let  A  be  the  adjacency  matrix  of  the  line 
graph  L(A).  By  the  definition  of  the  line  graph,  we  have  A(ex,  ey )  =  1  if  jx  =  iy,  and 
A(ex,  ey )  =  0  otherwise  (ex,  ey  =  1, ...,  m). 

We  define  two  m  x  1  vectors  u  and  v  as:  u(ex)  =  u(ix)  and  v(ex)  =  v(jx)  with 
ex  =  1,...,  Tu¬ 
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Next,  we  will  show  that  u  and  v  are  the  left  and  right  eigenvectors  of  A  respectively, 
with  A  being  the  corresponding  eigenvalue. 

For  any  edge  ex  (ex  =  1, ...,  m),  we  have 

A(ex,:)v  =  Y_  v(ey) 

ey  :A(ex,ey)=l 

=  ^  v(jy)  (all  edges  adjacent  from  ex) 

ey:iy=jx 


Y_  v0y) 


(all  edges  from  node  jx) 


=  Av(jy)  (due  to  Eq.  (8.1)) 
=  Av(ex)  (due  to  dfn.  of  v) 

Similarly,  for  any  edge  ex(ex  =  1, ...,  m),  we  have 
u'A(:,ex)  =  Y_  fi(ey) 

ey:A(ey,ex)=l 


y~  u(iy )  (all  edges  adjacent  to  ex) 

eu:iy=u 


^  U(iy) 


(all  edges  to  node  ix) 


=  Au(ix)  (due  to  Eq.  (8.1)) 
=  Au(ex)  (due  to  dfn.  of  u) 


Putting  Eq.  (8.2)  and  Eq.  (8.3)  together,  we  have  Av  =  Av  and  u'A 
completes  the  proof. 


(8.2) 


(8.3) 

Au',  which 
□ 


By  Lemma  8.1,  it  seems  that  edge  deletion  (Prob.  8.1)  can  be  transformed  to  the  node 
deletion  problem  on  the  line  graph  -  that  is,  select  a  subset  of  k  nodes  from  the  line 
graph  L(A)  whose  deletion  creates  the  largest  decrease  in  terms  of  the  leading  eigenvalue 
of  L(A).  However,  by  the  following  lemma,  the  node  deletion  problem  itself  is  still  a 
challenging  task. 

Lemma  8.2.  Hardness  of  Node  Deletion.  It  is  NP-Complete  to  find  a  set  of  k  nodes  from  a 
graph  A,  whose  deletion  will  create  the  largest  decrease  of  the  largest  eigenvalue  of  the  graph  A. 

PROOF.  The  proof  can  be  done  by  the  reduction  from  the  independent  node  set 
problem,  which  is  known  to  be  NP-Complete  [Kar72],  The  detailed  proof  is  omitted  for 
brevity.  □ 


That  said,  we  seek  an  effective  algorithm  that  directly  solves  the  NETMELT  problem 
next. 


135 


8.3.2  Proposed  K-EdgeDeletion  Algorithm 


The  key  to  solving  Prob.  8.1  (NETMELT)  is  to  quantify  the  impact  of  deleting  a  set 
of  edges  in  terms  of  the  leading  eigenvalue  A.  The  naive  way  is  to  recompute  the 
leading  eigenvalue  A  after  deleting  the  corresponding  set  of  edges  -  the  smaller  the  new 
eigenvalue,  the  better  the  subset  of  the  edges.  But  it  is  computationally  infeasible  for 
large  graphs  since  it  takes  0(m)  time  for  each  of  the  (™)  possible  sets,  as  in  general,  the 
impact  for  a  given  set  of  the  edges  (in  terms  of  decreasing  the  leading  eigenvalue  A)  is 
not  equal  to  the  summation  of  the  impact  of  deleting  each  individual  edge. 

Let  u  and  v  be  the  leading  left  eigenvector  and  right  eigenvector  of  the  graph  A, 
respectively.  Intuitively,  the  left  eigen-score  u(i)  and  the  right  eigen-score  v(j)  (i,j  = 
1, ...,  n)  provide  some  importance  measure  for  the  corresponding  nodes  i  and  j.  The  core 
idea  of  the  proposed  K-EDGEDELETION  algorithm  is  to  quantify  the  impact  of  each  edge 
by  the  corresponding  left  and  right  eigen-scores  independently  (step  9) .  Our  upcoming 
analysis  in  the  next  subsection  shows  that  this  strategy  (1)  leads  to  a  good  approximation 
of  the  actual  impact  wrt  decreasing  the  leading  eigenvalue;  and  (2)  naturally  de-couples 
the  dependence  among  the  different  edges.  As  a  result,  we  can  avoid  the  combinatorial 
enumeration  in  Prob.  8.1  by  picking  the  top-/c  edges  with  the  highest  individual  impact 
scores  (step  9). 

Note  that  steps  2-7  in  Alg.  3  are  to  ensure  that  all  the  eigen-scores  (i.e.,  u(i),  v(j ) (i,  j  = 
1, ...,  n))  are  non-negative.  According  to  the  Perron-Frobenius  theorem  [GL96],  such 
eigenvectors  u  and  v  always  exist. 


Algorithm  3  K-EDGEDELETION 

Input:  the  adjacency  matrix  A  and  the  budget  k 
Output:  k  edges 

1:  compute  the  leading  eigenvalue  A  of  A;  let  u  and  v  be  the  corresponding  left  and 
right  eigenvectors,  respectively; 

2:  if  mini=i/.../nu(i)  <  0  then 
3:  assign  u  < - u 

4:  end  if 

5:  if  mini=i/.../nv(i)  <  0  then 
6:  assign  v  < - v 

7:  end  if 

8:  for  each  edge  ex  :  (ix,jx)  ex  =  1, ...,  m;ix,  jx  =  l,...,n  do 
9:  score(ex)  =  u(ix)v(jx); 

10:  end  for 

11:  return  top-/c  edges  with  the  highest  score  (ex) 
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8.3.3  Proofs  and  Analysis 

Here,  we  analyze  the  accuracy  and  the  efficiency  of  the  proposed  K-EDGEDELETION 
algorithm. 

The  accuracy  of  the  proposed  K-EDGEDELETION  is  summarized  in  Lemma  8.3.  Ac¬ 
cording  to  Lemma  8.3,  the  first-order  matrix  perturbation  theory,  together  with  the  fact 
that  many  real  graphs  have  large  eigen-gap,  provides  a  good  approximation  to  the  impact 
of  a  set  of  edges  in  terms  of  decreasing  the  leading  eigenvalue.  What  is  more  important, 
with  such  an  approximation,  the  impact  of  the  different  edges  are  now  de-coupled  from 
each  other.  Therefore,  we  can  avoid  the  combinatorial  enumeration  of  Prob.  8.1  by  simply 
returning  the  top -k  edges  with  the  highest  individual  impact  scores  (step  9  in  Alg.  3). 

Notice  that  by  Lemma  8.3,  there  is  an  0(k)  gap  between  the  approximate  and  the 
actual  impact  of  a  set  of  edges  in  terms  of  decreasing  the  leading  eigenvalue.  Our 
experimental  evaluations  show  that  the  correlation  between  the  approximate  and  the 
actual  impact  is  very  high  (See  Section  6  for  details),  indicating  that  it  indeed  provides  a 
good  approximation  for  the  actual  decrease  of  the  leading  eigenvalue. 

Lemma  8.3.  Let  A  be  the  (exact)  first  eigenvalue  of  A,  where  A  is  the  perturbed  version  of  A  by 
removing  all  of  its  edges  indexed  by  the  set  8.  Let  6  =  A  —  A2  be  the  eigen-gap  of  the  matrix  A 
where  A2  is  the  second  eigenvalue  of  A,  and  c  —  l/(u'v).  If  A  is  the  simple  first  eigenvalue  of  A, 
and  5  ^  2Vk,  then  A  —  A  =  c  u(ix)v(jx)  +  O(k). 

PROOF.  Let  At(i  =  1,  ...,n)  be  the  ordered  eigenvalues  of  A  (i.e.,  |A|  =  |Ai|  ^  |A2|... 
|An|).  Let  A|(i  =  1,  ...,n)  be  the  corresponding  eigenvalues  of  A.  Notice  that  we  omitted 
the  subscripts  for  the  leading  eigenvalues  (i.e..  A!  =  A,  and  Ai  =  A). 

Let  A  =  A  +  E.  We  have  || E|| pro  =  \/k. 

According  to  the  first-order  matrix  perturbation  theory  (p.183  [SS90]),  we  have 

A,  =  A1  +  UA+0(||E||2) 

u'v 

=  Ai  -  c  Y_  u(ixMjx)  +  o(k)  (8.4) 

exES 


Next,  we  will  show  that  Ai  is  indeed  the  leading  eigenvalue  of  A.  To  this  end,  again 
by  the  matrix  perturbation  theory  (p.203  [SS90]),  we  have 

Ai  ^  Ai  —  ||E||2  ^  Ai  —  || E || Fro  ^  Ai  —  yfk 

At  ^  At  +  ||E||2  ^  A|  +  ||E|| pTO  ^  A|  -f-  Vk(i  J-s  2)  (8-5) 

Since  5  =  Ai  —  A2  ^  2\/k,  we  have  Ai  ^  At(i  =  2, ...,  n).  In  other  words,  we  have  that 
Ai  =  A  is  the  leading  eigenvalue  of  A.  Therefore, 

A  —  A  =  c  Y_  u(ix)v(jx)  +  0(k)  (8.6) 

exES 
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which  completes  the  proof.  □ 

The  efficiency  of  the  proposed  K-EDGEDELETION  is  summarized  in  the  following 
lemma,  which  says  that  with  a  fixed  budget  k,  K-EdgeDeletion  is  linear  wrt  the  size  of 
the  graph  for  both  time  and  space  cost. 

Lemma  8.4.  Efficiency  of  K-EdgeDeletion.  The  time  cost  ofAlg.  3  is  0(mk  +  n).  The 
space  cost  ofAlg.  3  is  0(n  +  m  +  k). 

PROOF.  Using  the  power  method,  step  1  takes  0(m)  time.  Steps  2-7  take  0(n) 
time.  Steps  8-10  take  O(m)  time.  Step  11  takes  O(mk)  time.  Therefore,  the  overall  time 
complexity  of  Alg.  3  is  0(mk  +  n),  which  completes  the  proof  of  the  time  cost. 

We  need  0(m)  to  store  the  original  graph  A.  It  takes  O(n)  and  0(1)  to  store  the  eigen¬ 
vectors  and  eigenvalue,  respectively.  We  need  additional  0(m)  to  store  the  scores  (Step 
9)  for  all  the  edges.  Finally,  it  takes  0(k)  for  the  selected  k  edges.  Therefore,  the  overall 
space  complexity  of  Alg.  3  is  O  (m+n+k),  which  completes  the  proof  of  the  space  cost.  □ 


8.4  Proposed  Algorithm  for  NetGel 

In  this  Section,  we  address  the  NetGel  problem  (Prob.  8.2),  where  we  want  to  add  a  set 
of  new  links  into  the  graph  A  so  that  its  leading  eigenvalue  A  will  increase  as  much  as 
possible.  We  first  present  the  proposed  K-EdgeAddition  algorithm,  and  then  analyze 
its  accuracy  as  well  as  efficiency. 

8.4.1  Proposed  K-EdgeAddition  Algorithm 

Let  7  be  a  set  of  non-existing  edges  in  A,  that  is,  for  each  ex  :  (ix,  jx)  e  7,  we  have 
A(ix,  jx)  =  0.  Let  A  be  the  leading  eigenvalue  of  the  new  adjacency  matrix  A  by  intro¬ 
ducing  the  new  edges  indexed  by  the  set  7.  By  the  similar  procedure  as  in  the  proof  of 
Lemma  8.3,  we  can  show  that  the  impact  of  the  new  set  of  edges  T  in  terms  of  increasing 
the  leading  eigenvalue  A  —  A  can  be  approximated  as 

A  —  A  «  Y_  u(ix)v(jx)  (8.7) 

Therefore,  it  seems  that  we  could  use  a  similar  procedure  as  K-EDGEDELETION 
to  solve  the  NetGel  problem  (referred  to  as  'Naive-Add'):  for  each  non-existing  edge 
ex  :  (he/ jx)/  calculate  its  score  as  score(ex)  =  u(ix)v(jx);  and  pick  top-k  non-existing 
edges  with  the  highest  scores. 

However,  many  real  graphs  are  very  sparse,  i.e.,  m  <<  n2.  Therefore,  we  have 
0(n2  —  m)  ~  0(n2)  possible  non-existing  edges.  In  other  words.  Naive- Add  requires 
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quasi-quadratic  time  wrt  the  number  of  the  nodes  (n)  in  the  graph,  which  does  not  scale 
to  large  graphs. 

To  address  this  issue,  we  propose  an  efficient  algorithm,  which  is  summarized  in 
Alg  4.  The  core  idea  of  K-EdgeAddition  is  to  prune  a  large  portion  of  the  non-existing 
edge  pairs  based  on  their  left  and  right  eigen-scores.  As  in  Alg.  3,  we  take  the  same 
procedure  to  make  sure  that  the  left  and  right  eigenvectors  (u,  v)  are  non-negative.  We 
omit  these  steps  in  Alg  4  for  brevity. 

Algorithm  4  K-EDGEADDITION 

Input:  the  adjacency  matrix  A  and  the  budget  k 
Output:  k  non-existing  edges 

1:  compute  the  left  (u)  and  right  (v)  eigenvectors  of  A  that  correspond  to  the  leading 
eigenvalue  (u,v  ^  0); 

2:  calculate  the  maximum  in-degree  (din.)  and  out-degree  (dout)  of  A,  respectively; 

3:  find  the  subset  of  k  +  din  nodes  with  the  highest  left  eigen-scores  Ui.  Index  them  by 

4:  find  the  subset  of  k  +  dout  nodes  with  the  highest  right  eigen-scores  Vj .  Index  them 
by  d; 

5:  for  each  edge  ex  :  (ix,jx)  A  e  3,jx  G  8,  A(ix,p)  =  0  do 
6:  score(ex)  =  u(ix)v(jx).  Index  them  by  T; 

7:  end  for 

8:  return  top-/c  non-existing  edges  with  the  highest  scores  among  CP. 


8.4.2  Proofs  and  Analysis 

Here,  we  analyze  the  accuracy  and  efficiency  of  the  proposed  K-EdgeAddition. 

The  accuracy  of  the  proposed  K-EDGEADDITION  is  summarized  in  Lemma  8.5,  which 
says  that  K-EdgeAddition  selects  the  same  set  of  edges  as  Naive-Add. 

Lemma  8.5.  Effectiveness  of  K-EdgeAddition.  Alg.  4  outputs  the  same  set  of  non¬ 
existing  edges  as  Naive- Add. 

PROOF.  Omitted  for  brevity.  □ 

The  efficiency  of  the  proposed  K-EDGEADDITION  is  summarized  in  the  following 
lemma. 

Lemma  8.6.  Efficiency  of  K-EdgeAddition.  The  time  cost  of  Alg.  4  is  0(m  +  nt  +  let2) . 
The  space  cost  of  Alg.  4  is  0(u  +  m  +  t2),  where  t  =  max{  k,  din,  dout). 

PROOF:  Using  the  power  method,  step  1  takes  O(m)  time.  Step  2  takes  0(m  +  n)  time. 
Steps  3-4  take  0(n(din  +  k))  and  0(n(dout  +  k))  time  respectively,  both  of  which  can 
be  written  as  O(nt).  Steps  5-7  take  0((k  +  din)(k  +  dout))  =  0(t2)  time.  Step  8  takes 
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0((k+  din)(k+  dout)k)  =  0(kt2).  Therefore,  the  overall  time  cost  is  0(m  +  nt  +  kt2), 
which  completes  the  proof  of  the  time  complexity. 

We  need  0(m)  to  store  the  original  graph  A.  It  takes  0(n)  to  store  the  eigenvec¬ 
tors  u  and  v.  Step  2  takes  additional  0(n  +  1)  space.  Steps  3-4  take  0(dtn  +  k)  and 
Ofdout  +  k)  space  respectively,  both  of  which  can  be  simplified  as  0(t).  Steps  5-7  take 
at  most  0((k  +  dirL)  (k  +  dout))  =  0(t2)  space.  Step  9  takes  0(k)  space.  Therefore,  the 
overall  space  cost  (by  omitting  the  smaller  terms)  is  0(m  +  nt  +  kt2),  which  completes 
the  proof  of  the  space  complexity.  □ 


8.5  Experimental  Evaluations 


Dataset 

n 

m 

Oregon-A 

633 

2,172 

Oregon-B 

1,503 

5,620 

Oregon-C 

2,504 

9,446 

Oregon-D 

2,854 

9,864 

Oregon-E 

3,995 

15,420 

Oregon-F 

5,296 

20,194 

Oregon-G 

7,352 

31,330 

Oregon-H 

10,860 

46,818 

Oregon-I 

13,947 

61,168 

Table  8.2:  Dataset  summary. 


Dataset 

k  =  10 

k  =  50 

k  =  100 

k  =  500 

k  =  1000 

Oregon-A 

0.999 

0.997 

0.995 

0.973 

0.924 

Oregon-B 

0.999 

0.999 

0.998 

0.993 

0.988 

Oregon-C 

1.000 

0.999 

0.999 

0.996 

0.991 

Oregon-D 

0.999 

0.999 

0.999 

0.994 

0.988 

Oregon-E 

1.000 

0.999 

0.999 

0.998 

0.995 

Oregon-F 

1.000 

0.999 

0.999 

0.998 

0.997 

Oregon-G 

1.000 

0.999 

0.999 

0.999 

0.998 

Oregon-Fl 

1.000 

1.000 

0.999 

0.999 

0.999 

Oregon-I 

1.000 

1.000 

0.999 

0.999 

0.999 

Table  8.3:  Evaluations  on  the  approx,  quality.  Larger  is  better. 


140 


In  this  section,  we  provide  empirical  evaluations  for  the  proposed  K-EdgeDeletion 
and  K-EdgeAddition  algorithms.  Our  evaluations  mainly  focus  on  (1)  the  effectiveness 
and  (2)  the  efficiency  of  the  proposed  algorithms. 

8.5.1  Experimental  Setup 

Data  sets.  We  used  a  popular  set  of  real  graphs  for  our  experiments  -  the  Oregon  AS 
(Autonomous  System)  router  graphs,  which  are  AS-level  connectivity  networks  inferred 
from  Oregon  route- views2.  These  were  collected  once  a  week,  for  9  consecutive  weeks. 
Table  8.2  summarizes  the  nine  graphs  we  used  in  our  evaluations. 

Evaluation  criteria.  As  mentioned  before,  the  leading  eigenvalue  A  of  the  graph  is 
the  only  graph  parameter  that  determines  the  epidemic  threshold  for  a  large  family  of 
information  dissemination  processes.  Therefore,  we  report  the  change  of  the  leading 
eigenvalue  for  the  effectiveness  comparison  -  for  both  NetMelt  and  NetGel  problems. 
A  larger  change  of  the  leading  eigenvalue  is  better,  which  suggests  that  we  can  affect 
the  outcome  of  the  dissemination  process  more.  In  addition,  we  also  run  virus  propa¬ 
gation  simulations  to  compare  how  different  methods  affect  the  actual  outcome  of  the 
propagation.  For  the  computational  cost  and  scalability,  we  report  the  wall-clock  time. 

Machine  configurations.  All  the  experiments  ran  on  the  same  machine  with  four  2.4GHz 
AMD  CPUs  and  48GB  memory,  running  Linux  (2.6  kernel). 


(a)  Oregon-A  (b)  Oregon-B  (c)  Oregon-C 

Figure  8.2:  The  decrease  of  the  leading  eigenvalue  vs.  the  budget  k.  Larger  is  better.  The  proposed 
K-EDGEDELETION  always  leads  to  the  biggest  decrease  of  the  leading  eigenvalue. 


8.5.2  Effectiveness  of  K-EdgeDeletion 

Approximation  Quality.  For  both  K-EdgeDeletion  and  K-EdgeAddition,  we  want 
to  approximate  the  actual  change  of  the  leading  eigenvalue  by  the  first  order  matrix 
perturbation  theory.  This  is  the  only  place  we  introduce  the  approximation.  By  Lemma  8.3, 

2http : //topology . eecs . umich . edu/data . html 
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Figure  8.3:  Comparison  of  minimizing  the  outcome  of  the  virus  propagation.  Fraction  of  infected 
nodes  vs.  time  stamp.  Lower  is  better.  The  proposed  K-EdgeDeletion  always  leads  to  the  least 
number  of  infected  nodes.  Notice  that  y-axis  is  in  the  logarithmic  scale. 


it  says  that  the  quality  of  such  an  approximation  depends  on  both  the  budget  k  as  well  as 
the  eigengap  of  the  original  graph,  with  an  O(k)  gap.  Here,  let  us  experimentally  evaluate 
how  good  this  approximation  is  on  real  graphs.  We  compute  the  linear  correlation 
coefficient  between  the  actual  and  approximate  leading  eigenvalue  after  we  randomly 
remove  k  (k  =  10,50,100,500,1000)  edges.  The  results  are  shown  in  table  8.3.  It  can 
be  seen  that  the  approximation  is  very  good  -  in  all  the  cases,  the  linear  correlation 
coefficient  is  greater  than  0.92,  and  often  it  is  very  close  to  1. 

The  Impact  of  Decreasing  the  Leading  Eigenvalue.  Here,  we  evaluate  the  effectiveness 
of  the  proposed  K-EdgeDeletion  in  terms  of  decreasing  the  leading  eigenvalue  A  of 
the  graph.  Lemma  8.1  suggests  that  the  'important'  edges  on  the  original  graph  A  might 
become  'important'  nodes  on  the  line  graph  L(A).  We  follow  this  intuition  to  design 
the  following  comparative  strategies:  (1)  randomly  select  k  edges  from  the  original 
graph  A  (referred  to  as  'Rand');  (2)  select  k  edges  with  the  highest  degrees  in  the  line 
graph  L(A)  (referred  to  as  'Line-Deg');  (3)  select  k  edges  with  the  highest  eigen-scores 
in  the  line  graph  L(A)  (referred  to  as  'Line-Eig');  and  (4)  select  k  edges  with  the  highest 
PageRank  scores  in  the  line  graph  L(A)  (referred  to  as  'Line-Page').  For  'Rand',  we  run 
the  experiments  100  times  and  report  the  average  result.  For  'Line-Deg',  we  have  two 
variants  by  using  out-degree  or  in-degree.  In  our  evaluation,  we  found  that  these  two 
variants  give  the  similar  results.  Therefore,  we  only  report  the  results  by  out-degree. 
For  the  same  reason,  we  only  report  the  results  by  the  right  eigen-scores  for  'Line-Eigs'. 
For  'Line-Page',  there  is  an  additional  parameter  of  the  teleport  probability.  We  run  the 
experiments  with  the  different  teleport  probabilities  and  report  the  best  results. 

For  brevity,  we  only  present  the  results  on  Oregon-A,  Oregon-B  and  Oregon-C  since 
the  results  on  the  rest  six  graphs  are  similar.  From  Fig.  8.2,  it  can  be  seen  that  our  K- 
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EdgeDeletion  always  leads  to  the  biggest  decrease  in  terms  of  the  leading  eigenvalue. 
For  example,  on  Oregon-C  graph,  the  proposed  K-EdgeDeletion  decreases  the  leading 
eigenvalue  by  3.8  with  the  budget  k  =  50,  which  is  almost  double  of  the  second  best 
method  (e.g.,  2.0  by  'Line-Deg').  Therefore,  we  expect  that  K-EdgeDeletion  would 
affect  the  outcome  of  the  dissemination  processes  better  than  the  alternative  choices,  e.g., 
having  less  number  of  infected  nodes  in  the  graph,  etc.  We  validate  this  next. 


Affecting  Virus  Propagation.  Next,  we  evaluate  the  effectiveness  of  the  proposed  K- 
EDGEDELETION  in  terms  of  minimizing  the  outcome  of  the  information  dissemination 
processes.  To  this  end,  we  simulate  the  virus  propagation  for  the  SIS  model  (susceptible- 
infective-susceptible)  on  the  graph  [WCWF03].  For  each  method,  we  delete  k  =  200 
edges  from  the  original  graph.  Let  s  =  Ab/ d  be  the  normalized  virus  strength  (bigger  s 
means  stronger  virus),  where  b  and  d  are  the  infection  rate  and  death  rate,  respectively. 
The  results  are  presented  in  Fig.  8.3,  which  is  averaged  over  1,000  runs.  It  can  be  seen 
that  the  proposed  K-EdgeDeletion  is  always  the  best  -  its  curve  is  always  the  lowest 
which  means  that  we  always,  as  desired,  have  the  least  number  of  infected  nodes  in  the 
graph  with  this  strategy.  In  Fig.  8.3,  'Original'  (the  yellow  curve)  means  that  we  simulate 
the  virus  propagation  on  the  original  graph  without  deleting  any  edges.  Notice  that 
when  the  virus  becomes  stronger  (Fig.  8.3(b)),  all  the  curves  except  the  proposed  method 
mix  with  'Original',  which  means  that  they  all  fail  to  affect  the  virus  propagation  in  this 
case.  In  contrast,  our  proposed  method  (the  red  curve)  can  still  significantly  reduce  the 
number  of  infected  nodes. 


Node  Deletion  vs.  Edge  Deletion.  Finally,  in  some  applications,  e.g.,  to  stop  malware 
propagation  on  the  computer  networks,  both  node  deletion  (e.g.,  shutting  down  some 
machines)  and  edge  deletion  (e.g.,  blocking  some  links  between  machines)  are  feasible. 
In  this  case,  we  want  to  know  which  strategy  (node  deletion  or  edge  deletion)  is  more 
effective  in  affecting  the  outcome  of  such  propagation  process.  To  this  end,  we  use  an 
effective  node  immunization  algorithm  [TPT+10]  to  delete  k  =  1, 10  nodes  respectively 
(referred  to  as  'Node-Del').  For  each  k,  we  then  use  our  proposed  K-EdgeDeletion 
to  delete  the  same  amount  of  edges  from  the  original  graph  (referred  to  as  'Edge-Del'). 
We  compare  the  decrease  of  the  leading  eigenvalues  of  the  two  methods.  The  results  are 
summarized  in  Fig.  8.4.  It  can  be  seen  that  'Edge-Del'  always  leads  to  a  bigger  decrease 
of  the  leading  eigenvalue  -  which  suggests  that  by  operating  on  the  edge  level,  we  can 
design  a  more  effective  algorithm  with  the  same  budget  to  affect  the  outcome  of  the 
information  dissemination  process.  The  results  are  consistent  with  the  intuition  -  not  all 
the  edges  adjacent  to  the  'important'  nodes,  which  the  node  immunization  algorithm 
aims  to  delete,  are  also  'important'  (e.g.,  many  edges  adjacent  to  an  'important'  node 
might  link  to/ from  some  degree-1  nodes).  In  other  words,  edge  deletion  enables  us  to 
optimize  the  underlying  graph  structure  on  a  finer  granularity  by  picking  each  individual 
edge  one  by  one. 
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Node  Deletion  vs.  Edge  Deletion 


(a)  k  =  1 


Node  Deletion  vs.  Edge  Deletion 
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(b)  k  =  10 

Figure  8.4:  Comparison  between  node  deletion  vs.  edge  deletion.  Larger  is  better.  With  the  same 
amount  of  edges  deleted,  our  proposed  K-EdgeDeletion  (red)  leads  to  a  bigger  decrease  in 
terms  of  the  leading  eigenvalue. 

8.5.3  Effectiveness  of  K-EdgeAddition 

To  our  best  knowledge,  there  are  no  existing  methods  to  add  k  new  links  into  an  existing 
graph  in  order  to  increase  its  leading  eigenvalue.  Let  A  be  the  complementary  graph  of 
A,  which  has  the  same  node  set  as  A,  and  A(i,  j)  =  1  iff  A ( i,  j )  =  0.  With  the  notation 
of  the  complementary  graph,  we  use  the  following  intuition  to  design  the  comparative 
methods:  to  select  k  'important'  edges  from  the  complementary  graph  A  and  add  them  into 
the  original  graph  A.  More  specifically,  we  compare  the  proposed  K-EDGEADDITION 
with  the  following  strategies:  (1)  randomly  select  k  edges  (referred  to  as  'Rand');  (2) 
select  k  edges  with  the  highest  out-degrees  in  the  line  graph  of  the  complementary  graph 
A  (referred  to  as  'Comp Deg');  (3)  select  k  edges  with  the  highest  right  eigen-scores  in 
the  line  graph  of  the  complementary  graph  A  (referred  to  as  'CompEigs');  (4)  select  k 
edges  with  the  highest  PageRank  scores  in  the  line  graph  of  the  complementary  graph 
A  (referred  to  as  'CompPage');  and  (5)  select  k  edges  by  running  K-EdgeDeletion  in 
the  complementary  graph  A  (referred  to  as  'CompDelete').  Again,  for  'Rand',  we  run 
the  experiments  100  times  and  report  the  average  result.  We  only  report  the  results  of 
'CompDeg'  by  out-degree  and  those  of  'CompEig'  by  right  eigen-scores,  respectively, 
since  the  other  variants  give  the  similar  performance.  For  'CompPage',  we  run  the 
experiments  with  the  different  teleport  probabilities  and  report  the  best  results. 
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The  Impact  of  Increasing  the  Leading  Eigenvalue.  We  first  evaluate  the  effectiveness  of 
the  proposed  K-EdgeAddition  in  terms  of  increasing  the  leading  eigenvalue  of  the 
graph.  For  brevity,  we  only  present  the  results  on  Oregon-A,  Oregon-B  and  Oregon-C  since 
the  results  on  the  rest  of  the  graphs  are  similar.  From  Fig.  8.5,  it  can  be  seen  that  the 
proposed  K-EdgeAddition  always  leads  to  the  biggest  increase  in  terms  of  the  leading 
eigenvalue  of  the  graph.  Notice  that  for  all  the  comparative  methods,  they  behave  like 
'Rand'  (blue  curve),  especially  when  the  budget  k  is  small. 

Affecting  Virus  Propagation.  We  also  evaluated  the  effectiveness  of  the  proposed  K~ 
EDGEADDITION  in  terms  of  maximizing  the  outcome  of  the  information  dissemination 
process.  To  this  end,  again,  we  simulate  the  virus  propagation  for  the  SIS  model  on 
the  graph.  For  each  method,  we  add  k  =  200  new  edges  into  the  graph.  Again,  let 
s  =  Ab/d  be  the  normalized  virus  strength,  with  bigger  s  being  stronger  virus.  Here, 
our  goal  is  to  increase  the  number  of  'infected'  nodes  (e.g.,  having  more  people  in  the 
social  networks  to  adopt  a  piece  of  good  idea,  etc)  by  introducing  a  set  of  new  links 
into  the  graph.  The  result  is  presented  in  Fig.  8.6,  which  is  averaged  over  1,000  runs. 
It  can  be  seen  that  the  proposed  K-EdgeAddition  is  always  the  best  -  its  curve  is 
always  the  highest  which  means  that  we  always  have  the  largest  number  of  'infected' 
nodes  in  the  graph  with  this  strategy.  Notice  that  when  the  strength  of  the  virus  is  weak 
(Fig.  8.6(a)),  all  the  curves  except  the  proposed  method  mix  with  or  are  very  close  to 
'Original'  (yellow  curve),  which  means  that  they  have  little  impact  to  boost  the  outcome 
of  the  propagation  in  this  case.  In  contrast,  our  proposed  method  (the  red  curve)  can 
still  significantly  increase  the  number  of  'infected'  nodes.  Therefore,  we  conclude  that 
our  proposed  K-EdgeAddition  is  much  more  effective  to  guild  the  outcome  of  the 
dissemination  process. 


(a)  Oregon-A 


(b)  Oregon-B 


(c)  Oregon-C 


Figure  8.5:  The  increase  of  the  leading  eigenvalue  vs.  the  budget  k.  Larger  is  better.  The  proposed 
K-EDGEADDITION  always  leads  to  the  largest  increase  of  the  leading  eigenvalue.  Notice  that 
y-axis  is  in  the  logarithmic  scale. 
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(a)  s  =  1.1  (b)  s  =  1.3 

Figure  8.6:  Comparison  of  maximizing  the  outcome  of  virus  propagation.  Fraction  of  'infected' 
nodes  vs.  time  stamp.  Larger  is  better.  The  proposed  K-EdgeAddition  always  leads  to  the 
largest  number  of  'infected'  nodes.  Notice  that  y-axis  is  in  the  logarithmic  scale. 


8.5.4  Scalability 

We  use  the  subsets  of  the  largest  data  set  Oregon-1  to  evaluate  the  scalability  of  the 
proposed  algorithms.  The  results  are  presented  in  Fig.  8.7.  We  can  see  that  the  proposed 
K-EDGEDELETION  and  K-EDGEADDITION  scale  almost  near-linearly  wrt  m,  which 
means  that  they  are  suitable  for  large  graphs.  Notice  that  for  both  cases,  we  also  observe 
a  slight  super-linear  trend.  This  is  due  to  the  following  two  reasons:  (1)  for  both  K- 
EdgeDeletion  and  K-EdgeAddition,  we  use  the  power  method  to  compute  the 
leading  eigenvalue  and  the  corresponding  eigenvectors.  When  m  increases,  the  actually 
iteration  number  in  the  power  method  also  tends  to  increase;  (2)  for  K-EDGEADDITION 
when  m  increases,  the  maximum  degree  (max(  din,  dout))  also  increases  even  though  we 
fix  the  number  of  the  nodes  (n). 


(a)  K-EdgeDeletion  (b)  K-EdgeAddition 

Figure  8.7:  Scalability  of  proposed  algorithms.  Both  K-EDGEDELETION  and  K-EDGEADDITION 
scale  near-linearly  wrt  the  size  of  the  graph. 
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8.6  Related  Work 

In  this  section,  we  review  the  related  work  specific  to  the  problems  discussed  in  this 
chapter.  We  have  already  reviewed  most  of  the  related  work  in  previous  chapters  (more 
specifically  see  Chapters  6  and  7). 

Affecting  Algorithms.  Note  that  all  the  previous  works  focus  on  operating  on 
the  node  level  (i.e.,  delete  or  inoculate  a  set  of  'best'  nodes)  to  affect  the  outcome  of 
the  dissemination.  In  contrast,  we  study  the  equally  important,  but  much  less  studied 
affecting  algorithms  by  operating  on  the  edge  level. 

There  exist  some  empirical  evaluations  on  edge  removal  strategies  for  slightly  differ¬ 
ent  purposes,  such  as,  slowing  down  the  influenza  spreading  [MK09],  minimizing  the 
average  infection  probability  [SMHH11],  evaluating  and  comparing  the  attack  vulner¬ 
ability  [HKYH02],  etc.  The  closest  related  work  to  our  K-EDGEDELETION  algorithm 
is  [BS11],  which  proposed  a  convex  optimization  based  approach  to  approximately  mini¬ 
mize  the  leading  eigenvalue  of  the  graph.  However,  the  method  is  based  on  semi-definite 
programming  and  does  not  scale  to  large  graphs.  Moreover,  for  all  these  methods,  it 
remains  unclear  if  they  can  be  generalized  to  address  the  even  more  challenging  NetGel 
problem,  where  we  want  to  add  new  edges  to  promote  the  information  dissemination. 

Measuring  the  Importance  of  Nodes  and  Edges.  In  the  literature,  there  are  a  lot  of 
node  importance  measurements,  including  betweenness  centrality,  both  the  one  based 
on  the  shortest  path  [Fre77]  and  the  one  based  on  random  walks  [New05b,  KPST11] 
PageRank  [PBMW98],  HITS  [Kle98],  and  coreness  score  [MW03].  Our  work  is  also 
related  to  the  so-called  k-vital  edges  problem,  which  aims  to  delete  a  set  of  links  from 
the  graphs  to  increase  the  shortest  path  length  [LSHOO]  or  the  weight  of  the  minimum 
spanning  tree  of  the  remaining  graph  [She95].  K-vital  edge  problem  itself  is  known  to  be 
NP-Hard.  Other  remotely  related  work  includes  graph  augmentation  [PBG11],  graph 
sparsification  [KMST10],  network  inhibition  [Phi93]  and  network-interdiction  [Woo93, 
IW02],  Both  network  inhibition  and  network  interdiction  are  NP-Hard. 


8.7  Conclusion 

In  this  chapter,  we  studied  the  problem  of  how  to  optimize  the  link  structure  to  affect  the 
outcome  of  information  dissemination  processes.  The  main  contributions  of  the  work 
are: 


•  Algorithms.  We  observe  that  for  a  large  family  of  information  dissimilation  pro¬ 
cesses,  the  problem  boils  down  to  the  eigenvalue  optimization  problem.  We  propose 
an  effective,  scalable  algorithm  to  optimize  such  a  key  graph  parameter  (i.e.,  the 
leading  eigenvalue)  that  controls  the  information  dissemination  process,  for  both 
NetMelt  and  NetGel,  respectively; 

•  Proofs  and  Analysis.  We  show  the  accuracy  (Lemma  3  and  Lemma  5)  and  the  com¬ 
plexity  of  our  methods  (Lemma  4  and  Lemma  6);  the  hardness  of  the  problem 
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(Lemma  2),  and  the  equivalence  between  the  different  strategies  (Lemma  1,  Lemma  7 
and  Lemma  8); 

•  Experimental  Evaluations.  Our  evaluations  on  real  large  graphs  show  that  (a)  com¬ 
pared  with  alternative  choices  to  optimize  the  link  structure,  our  methods  are  much 
more  effective  to  affect  the  outcome  of  the  dissemination  process;  (b)  compared  with 
the  node  deletion  strategy,  our  K-EdgeDeletion  offers  a  more  effective  way  by 
operating  on  the  edge  level;  and  (c)  both  K-EdgeDeletion  and  K-Edge  Addition 
scale  to  large  graphs. 


APPENDIX 

Higher-Order  NetMelt.  From  Lemma  8.3,  it  can  be  seen  that  the  only  place  we  intro¬ 
duce  the  approximation  in  Alg.  3  is  to  approximate  the  actual  decrease  of  the  leading 
eigenvalue  by  the  first-order  matrix  perturbation  theory.  The  readers  might  wonder  if 
we  can  further  improve  the  quality  by  using  higher-order  matrix  perturbation  theory, 
while  maintaining  the  linear  scalability  of  the  algorithm. 

We  explored  second-order  matrix  perturbation  theory  to  approximate  the  actual 
decrease  of  the  leading  eigenvalue,  and  found  that  (1)  it  generates  very  similar  results 
as  the  proposed  K-EDGEDELETION  algorithm  and  (2)  it  requires  5-10x  more  wall-clock 
time.  The  reason  might  be  that  for  the  NETMELT  problem,  the  first-order  perturbation 
already  gives  a  very  good  approximation.  Therefore,  in  practice,  we  recommend  K- 
EdgeDeletion  for  simplicity. 

Nonetheless,  the  new  algorithm  based  on  the  second-order  perturbation  exhibits 
some  interesting  theoretic  properties.  It  also  helps  understand  the  relationship  between 
edge  deletion  and  node  deletion  on  the  algorithmic  level.  We  present  it  here  for  the 
completeness. 

Let  c  =  W,  with  second-order  matrix  perturbation,  we  can  approximate3  the  impact 
of  deleting  a  set  of  edges  S  in  terms  of  the  leading  eigenvalue  as: 

A  -  A  ~  Impact (S)  =  c(  Y  u(ix)v(jx) 

-  ^  Y-  u(i*)v(jy))  (8.8) 

ES,jx— iy 

Compared  with  the  first-order  perturbation  (eq.  (8.6)),  we  have  an  additional  penal¬ 
ized  term  in  eq.  (8.8):  u(ix)v(jy)  for  any  two  adjacent  edges  ex  and  ey.  The  intuition  is  to 
encourage  the  edges  in  the  set  S  to  be  far  away  (not  adjacent)  from  each  other. 

By  eq.  (8.8),  the  impact  of  different  edges  in  the  set  8  is  no  longer  independent  with 
each  other.  At  the  first  glance,  this  might  complicate  the  algorithm  since  now  we  need  to 

3These  formulae  is  similar  as  the  one  in  [MSN10] 
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optimize  at  the  set  level,  that  is,  to  find  a  set  of  edges  that  collectively  maximize  eq.  (8.8). 
However,  by  the  following  lemma,  the  impact  defined  in  eq.  (8.8)  exhibits  some  nice 
diminishing  return  properties. 

Lemma  8.7.  Second-Order  Approximation  Properties.  The  Impact[S)  defined  in  eq.  (8.8) 
has  the  following  properties: 

(1)  Impact[Q ?)  =  0,  where  ®  is  an  empty  set; 

(2)  Impact (8)  is  monotonically  non-decreasing  wrt  the  set  S; 

(3)  Impact {§ )  is  sub-modular  wrt  the  set  S. 

Proof.  Easy  to  check.  □ 

Thanks  to  such  diminishing  return  properties,  it  naturally  leads  to  the  following 
greedy  algorithm  (K-EDGEDELETION++)  to  find  a  near-optimal  subset  of  edges  to  delete 
from  the  original  graph  A.  And  it  can  be  shown  that  the  overall  time  complexity  of 
K-EdgeDeletI0NH — h  remains  linear  wrt  the  size  of  the  graph. 

Algorithm  5  K-EDGEDELETION++ 

Input:  the  adjacency  matrix  A  and  the  budget  k 
Output:  k  edges  indexed  by  set  S 

1:  compute  the  first  eigen-value  A  of  A;  compute  the  corresponding  left  and  right 
eigenvectors  u  and  v  (u,  v  ^  0),  respectively; 

2:  initialize  the  set  8  to  be  empty; 

3:  score (ex)  =  u(ix)v(jx)  (ex  :  (ix,jx),  ex  =  l,...,m); 

4:  for  k0  =  1,  ...,k  do 

5:  find  e0  =  argmax^  ^§score(ex); 

6:  add  the  new  edge  e0  :  (io,  jo)  into  S; 

7:  for  each  edge  ey  :  (iy, jy)  s.t.  jy  =  i0  do 

8:  score (ey)  «-  score(ey)  -  l/(2A)u(iy)v(j0); 

9:  end  for 

10:  for  each  edge  ey  :  (iy, jy)  s.t.  iy  =  jo  do 

11:  score (ey)  <-  score(ey)  -  l/(2A)u(io)v(jy); 

12:  end  for 

13:  end  for 


An  interesting  property  of  Alg.  5  is  that  it  builds  the  equivalence  between  edge 
deletion  and  node  deletion  on  the  algorithmic  level: 

Lemma  8.8.  Equivalence  of  Alg.  5  to  Node  Immunization.  Let  S  be  the  set  of  edges  by 
running  Alg.  5  on  graph  A;  T  be  the  set  of  edges  by  running  the  node  immunization  algo¬ 
rithm  [TPT+10]  on  the  line  graph  L(A);  and  |S|  =  |T|.  We  have  S  =  T. 
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Chapter  9 

Finding  Culprits 


Given  a  snapshot  of  a  large  graph,  in  which  an  infection  has  been  spreading  for  some 
time,  can  we  identify  those  nodes  from  which  the  infection  started  to  spread?  In  other 
words,  can  we  reliably  tell  who  the  culprits  are?  In  this  chapter,  we  answer  this  ques¬ 
tion  affirmatively,  and  give  an  efficient  method  called  NETSLEUTH  for  the  well-known 
Susceptible-Infected  virus  propagation  model. 

Essentially,  we  are  after  that  set  of  seed  nodes  that  best  explain  the  given  snapshot. 
We  propose  to  employ  the  Minimum  Description  Length  principle  to  identify  the  best  set 
of  seed  nodes  and  virus  propagation  ripple,  as  the  one  by  which  we  can  most  succinctly 
describe  the  infected  graph.  We  give  an  highly  efficient  algorithm  to  identify  likely  sets 
of  seed  nodes  given  a  snapshot.  Then,  given  these  seed  nodes,  we  show  we  can  optimize 
the  virus  propagation  ripple  in  a  principled  way  by  maximizing  likelihood.  With  all  three 
combined,  NETSLEUTH  can  automatically  identify  the  correct  number  of  seed  nodes,  as 
well  as  which  nodes  are  the  culprits. 

Experimentation  on  our  method  shows  high  accuracy  in  the  detection  of  seed  nodes, 
in  addition  to  the  correct  automatic  identification  of  their  number.  Moreover,  we  show 
NETSLEUTH  scales  linearly  in  the  number  of  nodes  of  the  graph. 


9.1  Introduction 

We  focus  on  a  different  and  difficult  question  in  this  chapter:  Given  a  single  snapshot 
of  a  partly  infected  network,  how  we  can  reliably  identify  those  nodes  from  which  the 
epidemic  started;  whether  for  inoculation  to  prevent  future  epidemics,  or  for  exploitation 
for  viral  marketing. 

As  such,  given  a  snapshot  of  a  large  graph  G  (V,  £)  in  which  a  subset  of  nodes  V'  C  V 
is  currently  infected,  and  assuming  the  Susceptible-Infected  (SI)  propagation  model,  we 
consider  the  problem  of  how  to  efficiently  and  reliably  find  those  seed  nodes  §C  V'  from 
which  the  epidemic  started,  without  requiring  the  user  to  choose  the  number  seed  nodes 
in  advance.  In  other  words,  we  address  the  questions:  How  many  culprits  are  there,  and 
who  are  they? 
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Figure  9.1:  Example:  Culprits,  how  many,  and  which  ones ?  A  snapshot  of  a  2D  grid  in  which  an 
infection  has  been  stochastically  spreading.  Grey  circles  are  infected  nodes,  while  Grey  dots 
are  un-infected.  The  2  Blue  stars  denote  the  true  seeds.  The  2  Red  diamonds  denote  the  seeds 
automatically  discovered  by  NetSleuth — that  is,  both  in  number  (two)  and  location  (being 
spatially  very  close  to  the  true  seeds). 


We  propose  to  employ  the  Minimum  Description  Length  (MDL)  principle  [Gr7]  to 
identify  that  set  of  seed  nodes,  and  that  virus  propagation  ripple  starting  from  those 
nodes  that  best  describes  the  given  snapshot.  We  give  an  highly  efficient  algorithm  to 
identify  likely  seed  nodes,  and  show  we  can  easily  optimize  the  description  length  of 
the  virus  propagation  ripple  for  a  given  seed  set  by  greedily  maximizing  likelihood.  As 
such,  we  can  identify  the  best  set  of  seed  nodes  in  a  principled  manner,  without  having 
to  choose  k,  the  number  of  seed  nodes  in  advance. 

As  an  example,  consider  Figure  9.1.  It  depicts  an  example  grid-structured  graph,  in 
which  a  subgraph  has  been  infected  by  a  stochastic  process  starting  from  two  seed  nodes. 
The  plot  shows  the  true  seed  nodes,  as  well  as  the  seed  nodes  automatically  identified  by 
NetSleuth;  it  finds  the  correct  number  of  seed  nodes,  and  places  these  where  a  human 
would;  in  fact,  the  discovered  seeds  have  a  higher  likelihood  for  generating  this  infected 
subgraph  than  the  true  seed  nodes. 

We  develop  a  two  step  approach  by  first  finding  high-quality  seeds  given  the  number 
of  seeds,  and  then  using  our  carefully  designed  MDL  score  to  pinpoint  the  true  number 
of  seeds.  For  the  first  part,  we  use  the  notion  of  'exoneration'  from  the  un-infected 
frontier — e.g.,  in  Figure  9.1  the  nodes  on  the  edge  of  the  infected  snapshot  are  unlikely  to 
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Table  9.1:  Comparison  between  three  culprit-identifying  methods:  NetSleuth,  Rumor- 
centrality  [SZ11],  and  Effectors  [LTGM10] 


infection 

model 

k>l 

automatically 
determines  k 

0(-)t 

NetSleuth  (our  method) 

SI 

/ 

/ 

Linear 

Rumor-centrality  [SZ11] 

SI 

- 

- 

Quadratic 

Effectors  [LTGM10] 

IC 

/ 

- 

Quadratic 

1  Running  time  given  for  arbitrary  graphs. 


be  the  culprits  due  to  the  large  number  of  un-infected  nodes  surrounding  them.  Based 
on  this  idea,  we  develop  a  novel  'submatrix-laplacian'  method  to  find  out  the  best  seed 
sets  given  a  number  of  seeds  (see  Section  9.4  for  more  details).  Given  these  seed-sets,  we 
also  give  an  efficient  algorithm  to  compute  the  MDL  scores,  thus  finding  the  number  of 
seeds  in  a  parameter-free  way. 

Although  network  infection  models  have  been  researched  extensively,  identifying 
the  seed  nodes  of  an  epidemic  is  surprisingly  understudied.  We  are,  however,  not  the 
first  to  research  this  problem.  Recently,  Shah  and  Zaman  [SZ10,  SZ11]  developed  rumor- 
centrality  for  identifying  the  single  source  node  of  an  epidemic.  In  contrast,  we  allow  for 
multiple  seed  nodes,  and  automatically  determine  their  number.  Lappas  et  al.  [LTGM10] 
studied  the  'Effectors'  problem  of  identifying  k  seed  nodes  in  a  steady-state  network 
snapshot,  under  the  Independent  Cascade  (IC)  model.  In  contrast,  we  study  the  SI  model, 
allow  the  snapshots  from  any  time  during  the  epidemic,  and  our  approach  is  parameter- 
free  as  by  MDL  we  can  automatically  identify  the  best  value  for  k.  Furthermore,  and 
very  importantly  for  large  graphs,  in  comparison  our  method  is  computationally  much 
more  efficient.  Table  9.1  gives  a  comparison  of  NetSleuth  to  these  methods.  We  discuss 
related  work  in  more  detail  in  Section  9.6. 

Experimentation  shows  that  NetSleuth  detects  seed  nodes  and  automatically  iden¬ 
tifies  their  number,  both  with  high-accuracy.  With  synthetic  data  we  show  it  can  handle 
difficult  fringe  cases,  and  is  in  agreement  with  human  intuition.  We  show  we  reliably 
identify  the  correct  number  of  seed  nodes  on  real  data,  and  also  that  our  detected  seeds 
are  of  very  high  quality  (measured  by  multiple  metrics).  Finally,  we  show  our  method 
scales  linearly  with  the  number  of  edges  of  the  graph. 

The  rest  of  the  chapter  is  organized  in  the  typical  way:  preliminaries,  our  problem 
formulation  and  method,  experiments,  related  work  and  then  conclusion. 


9.2  Preliminaries 

In  this  section  we  give  notation,  and  introduce  MDL  and  the  infection  spreading  model 
we  use. 
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9.2.1  Notation 


Table  9.2  gives  some  of  the  notation  and  symbols  we  will  be  using  in  this  work.  We 
consider  undirected,  unweighted  graphs  G  =  (V,  £)  of  N  =  |V|  nodes.  All  logarithms  in 
this  chapter  are  to  base  2,  and  we  adopt  the  standard  convention  that  OlogO  =  0.  We 
denote  the  transpose  of  any  matrix  or  vector  V  as  VT.  Finally  note  that  LA  is  a  submatrix 
of  L(G),  not  the  laplacian  matrix  of  Gi. 


Table  9.2:  Terms  and  Symbols 


Symbol 

Definition  and  Description 

SI  model 

Susceptible-Infected  model 

(3 

attack  probability  of  the  virus  in  the  SI  model 

G  =  (V,£) 

graph  under  consideration 

Gi  =  (!?!,£!) 

given  infected  subgraph  of  G 

R 

ripple,  list  of  sets  of  nodes  how  virus  propagates 

N 

V|,  number  of  nodes  in  graph  G 

Ni 

|Vi|,  number  of  nodes  in  graph  Gi 

d(i) 

degree  of  node  i 

T 

set  of  un-infected  nodes  having  at  least  one  infected  neighbor  (in  Vi) 

£f 

set  of  edges  connecting  nodes  in  3  to  VT 

A(G) 

adjacency  matrix  of  graph  G  (size  N  x  N) 

A 

adjacency  matrix  of  Gi  (size  N:  x  Nx) 

D(G) 

diagonal  degree  matrix  of  graph  G 

L(G) 

laplacian  matrix  of  G  i.e.  L(G)  =  D(G)  —  A(G) 

La 

submatrix  (size  Ni  x  Ni)  of  L(G)  corresponding  to  the  infected  graph  Gi 

Qmdl 

MDL-based  culprits  quality  measure  (see  §  9.5) 

Qjd 

set-Jaccard-distance-based  culprits  quality  measure  (see  §  9.5) 

9.2.2  The  Susceptible-Infected  Model 

The  most  basic  epidemic  model  is  the  so-called  'Susceptible-Infected'  (SI)  model  [AM91]. 
Each  object/ node  in  the  underlying  graph  is  in  one  of  two  states  -  Susceptible  (S)  or 
Infected  (I).  Once  infected,  each  node  stays  infected  forever.  Each  infected  node  tries  to 
infect  each  of  its  neighbors  independently  with  probability  [3  in  each  discrete  time-step, 
which  reflects  the  strength  of  the  virus. 

Note  that  here  1/(3  defines  a  natural  time-scale  (intuitively  it  is  the  expected  number 
of  time-steps  for  a  successful  attack  over  an  edge).  As  an  example,  if  we  assume  that  the 
underlying  network  is  a  clique  of  N  nodes,  under  continuous  time,  the  model  can  be 
written  as:  d|t'LL  =  |3(N  —  I (t) ) I(t),  where  I(t)  is  the  number  of  infected  nodes  at  time 
t — the  solution  is  the  logistic  function  and  it  is  invariant  to  (3  x  t. 
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9.2.3  Minimum  Description  Length  Principle 

The  Minimum  Description  Length  principle  (MDL)  [Gr 7],  is  a  practical  version  of  Kol¬ 
mogorov  Complexity  [LV93].  Both  embrace  the  slogan  Induction  by  Compression.  For 
MDL,  this  can  be  roughly  described  as  follows. 

Given  a  set  of  models  M,  the  best  model  M  e  Mis  the  one  that  minimizes  £  (M)  + 
£  (D  |  M),  in  which  £  ( M)  is  the  length  in  bits  of  the  description  of  M,  and  £ (D  |  M)  is 
the  length  of  the  description  of  the  data  encoded  with  M. 

This  is  called  two-part  MDL,  or  crude  MDL — as  opposed  to  refined  MDL,  where  model 
and  data  are  encoded  together  [Gr7],  We  use  two-part  MDL  because  we  are  specifically 
interested  in  the  model:  the  seed  nodes  and  ripple  that  give  the  best  description.  Further, 
although  refined  MDL  has  stronger  theoretical  foundations,  it  cannot  be  computed  except 
for  some  special  cases.  Note  that  MDL  requires  the  compression  to  be  lossless  in  order  to 
allow  for  fair  comparison  between  different  M  e  M. 

To  use  MDL,  we  have  to  define  what  our  models  M  are,  how  a  M  e  M  describes 
the  data  at  hand,  and  how  we  encode  this  all  in  bits.  Note,  that  in  MDL  we  are  only 
concerned  with  code  lengths,  not  actual  code  words. 


9.3  Our  Problem  Formulation 

Next  we  formulate  our  problem  in  terms  of  MDL.  Our  goal  is  to  obtain  the  most  succinct 
explanation  of  'what  happened'.  To  do  so,  we  require  two  ingredients:  the  first  is  a 
formal  objective — a  cost  function — which  we  discuss  in  this  section.  The  second  is  then 
an  algorithm  to  find  good  solutions,  which  we  give  in  Section  9.4. 

Our  cost  function  will  consist  of  two  parts,  1)  scoring  the  seed  set  (Model  cost)  and  2) 
scoring  the  successive  infected  nodes  starting  from  the  seed  (Data  cost). 

We  assume  that  both  sender  and  receiver  know  the  layout  of  G  =  (V,  £),  but  not 
which  nodes  are  in  Gi  =  (Vi,  £i).  This  makes,  using  the  general  formulation  of  MDL  in 
Section  9.2.3,  G  i  the  data  T>  we  want  to  describe  using  our  models  M.  As  such,  informally, 
our  goal  is  to  identify  those  nodes,  and  an  infection  propagation  ripple  starting  from 
those  nodes,  by  which  G  i  can  most  easily  be  described. 

9.3.1  Cost  of  the  Model 

As  our  models  we  consider  seed  sets.  A  seed  set  3  C  Vi  is  a  subset  of  |S|  nodes  of  Gi  from 
which  the  infection  starts  spreading — the  'patients  zero',  so  to  speak.  We  denote  by  £(§) 
the  encoded  length,  in  bits,  of  a  seed  set  S. 

To  describe  a  seed  set  S,  we  first  have  to  encode  how  many  nodes  S  contains.  This 
number,  |S|,  is  upper-bounded  by  the  number  of  nodes  in  G.  Hence,  by  using  straight¬ 
forward  block-encoding  we  can  encode  |S|  in  logN  bits,  by  which  we  spend  equally 
many  bits  to  encode  either  a  small  or  a  large  number.  In  general,  however,  we  favor  small 
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seeds  sets:  simple  explanations.  The  MDL  optimal  Universal  code  for  integers  [Ris83]  is 
therefore  a  better  choice  as  it  rewards  smaller  seed  sets  by  requiring  fewer  bits  to  encode 
their  size.  With  this  encoding,  T-j,  the  number  of  bits  to  encode  an  integer  n  ^  1  is  defined 

as  £n(u)  =  log*  (n)  +  log ( Co),  where  log*  is  defined  as  log*(n)  =  log(n)  +  log  log (n)H - , 

where  only  the  positive  terms  are  included.  To  make  £N  a  valid  encoding,  Co  is  chosen  as 
c0  =  2^'Cp,(b  ~  2.865064  such  that  the  Kraft  inequality  is  satisfied. 

To  identify  which  nodes  in  G  are  seed  nodes,  we  use  the  very  efficient  class  of  data- 
to-model  codes  [VV04],  A  data-to-model  code  is  essentially  an  index  into  a  canonically 
ordered  enumeration  of  all  possible  data  (values)  given  the  model  (the  provided  infor¬ 
mation).  Here,  we  know  |S|  unique  nodes  have  to  be  selected  out  of  N,  for  which  there 
are  (^j)  possibilities.  Assuming  a  canonical  order,  log  (,g.)  gives  us  the  length  in  bits  of 
an  index  to  the  correct  set  of  node  ids. 

Combining  the  above,  we  now  have  £  (8)  for  the  number  of  bits  to  identify  a  seed  set 
S  C  Vi  as 

£{S)=£„(|S|)+logfA  .  (9.1) 

9.3.2  Cost  of  the  Data  given  the  Model 

Next,  we  need  to  describe  the  infected  subgraph  Gi  given  a  seed  set  S.  We  do  this  by 
encoding  the  infection  propagation  ripple,  or  the  description  of  'what  happened'.  Starting 
from  the  seed  nodes,  per  time  step  we  identify  that  set  of  nodes  that  gets  infected  at  this 
time  step,  iterating  until  we  have  identified  all  the  infected  nodes.1 

Propagation  ripples:  More  formally,  a  propagation  ripple  R  is  a  list  of  node  ids  per 
time-step  t,  which  represents  the  order  in  which  nodes  of  G  j  became  infected,  starting 
from  8  at  time  t  =  0.  Let  us  write  Vj  (8,  R)  to  indicate  the  set  of  infected  nodes  at  time  t 
starting  from  seed  set  8  and  following  ripple  R,  with  Vj  =  8.  For  readability,  we  do  not 
write  8  and  R  wherever  clear  from  context.  As  such,  a  valid  propagation  ripple  R  is  a 
partitioning  of  node  ids  Vj  \  8  of  G  i,  where  every  node  in  a  part  is  required  to  have  an 
edge  from  a  node  j  e  VjW 

Clearly,  however,  not  every  ripple  from  the  seed  set  to  the  final  infected  subgraph  is 
equally  simple  to  describe.  For  instance,  the  more  infected  neighbors  an  uninfected  node 
has,  the  more  likely  it  is  that  it  will  get  infected,  as  it  is  under  constant  attack — therefore, 
it  should  be  more  succinct  to  describe  that  this  node  gets  infected  than  it  would  for  a 
node  under  single  attack. 

Frontier  sets:  To  encode  a  ripple  R,  at  each  time  t  we  consider  the  collection  of 
nodes  currently  under  attack  given  the  SI  model  (i.e.  non-infected  nodes  with  currently 
atleast  one  infected  neighbor,  or  if  t  =  0,  neighbors  to  a  seed-node  e  8).  We  refer  to 
this  set  as  Jrt,  for  the  frontier-set  at  time  t.  Define  attack  degree  a(n)  of  a  non-infected 

1When  not  interested  in  the  actual  ripple  R,  one  could  encode  Gi  by  its  overall  probability  starting  from 
S.  Obtaining  this  probability,  however,  is  very  expensive,  even  by  MCMC  sampling.  As  we  will  see  in 
Sections  9.4  and  9.5  computing  a  good  ripple  is  both  cheap  and  gives  good  results. 
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node  n  as  the  number  of  infected  neighbor  nodes  it  has  at  the  current  iteration,  i.e. 
a (n)  =  |{j  e  V  |  ejn  e  £  A  Xj  f t) } | ,  in  which  Xj  ft)  is  an  indicator  function  for  whether 
node  j  is  infected  at  time  t. 

We  divide  lA  into  disjoint  subsets  3/  per  attack  degree  i,  that  is,  into  sets  of  nodes 
having  the  same  attack  degree.  As  such,  we  have  A1  =  3/  U  J,1  U . . .,  and  correspondingly 
f l,  I\, . . .  for  the  sizes  of  these  subsets  (we  will  drop  using  the  t  superscript,  when  clear 
from  context). 

Starting  from  the  seed  set,  for  every  time  step  t  the  receiver  can  easily  construct 
the  corresponding  frontier  set  T* — which  leaves  us  to  transmit  which  of  the  nodes,  if 
any,  in  the  frontier  set  got  infected  in  the  current  iteration.  As,  however,  the  infection 
probabilities  per  attack  degree  differ,  we  transmit  this  information  per  3^. 

Probability  of  Infection:  The  SI  model  assumes  an  attack  probability  parameter  (3 — so, 
the  independent  probability  pd  of  a  node  in  Td  being  infected  is:  pd  =  1  —  (1  —  |3)d. 
Given  pd  we  can  write  down  the  probability  distribution  of  a  total  of  md  nodes  being 
infected  for  each  subset  Td.  This  is  simply  a  Binomial  with  parameter  pd  i.e. 

P(tud  |  fd,d)  =  Q^Pd^l -Pd)fd-md  • 

Hence,  as  such,  a  value  for  (3  determines  p. 

Encoding  a  Wave  of  Attack:  Given  p,  a  probability  distribution  for  seeing  md  nodes 
out  of  fd  infected  given  an  attack  degree  d,  we  need  —  log  p(md  |  fd,  d)  bits  to  optimally 
transmit  the  value  of  md.  That  is,  we  encode  md  using  an  optimal  prefix  code — for  which 
we  can  calculate  the  optimal  code  lengths  by  Shannon  entropy  [CT06].  Then,  once  we 
know  both  f  d  and  md,  we  can  use  code  words  of  resp.  —  log  ^  and  —  log  1  —  ^  bits 
long  to  transmit  whether  a  node  in  Td  got  infected  or  not.  This  gives  us 

£(3*)  =  ~  Y  6ogp(md|fd,d)  +mdlog^p 

+  (fd-™d)logl-^^  (9-2) 

for  encoding  the  infectees  in  the  frontier  set  at  time  t. 

Then,  for  the  recipient  to  know  when  to  stop  reading,  we  have  to  transmit  how  many 
time  steps  until  we  have  reached  Gi.  The  number  of  iterations  T  will  be  transmitted 
just  like  |S|,  using  L:i.  For  the  ripple  R,  starting  from  the  frontier-set  defined  by  the  seed 
nodes  S,  we  iteratively  transmit  which  nodes  got  infected  at  t  +  1 — which  in  turn  allows 
the  recipient  to  construct  3rt+1.  Note  that,  by  T  /FL )  we  assume  ripple  R  to  be  in  time 
scale  of  1/(3.  That  is,  for  low  (3  we  consider  a  lower  time  resolution  than  for  high  (3.  This 
is  because  the  SI  model  displays  a  natural  invariance  of  time-scale  (see  Section  9.2.2).  So 
we  have  ripple  R  that  gives  the  infections  at  every  1/(3  time-steps. 

With  the  above,  we  have  £(R  |  S)  for  the  encoded  length  of  a  ripple  R  starting  at  a 
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seed  set  S  as 


T 

£(R|S)  =£n(T)  +  ^£(^) 


t 


(9.3) 


9.3.3  The  Problem 

With  L  ( S )  and  L  ( R  |  § ) ,  we  have  as  the  total  description  length  L  ( G  i,  S,  R)  of  an  infected 
subgraph  G  i  of  G  following  a  valid  infection  propagation  ripple  R  starting  from  a  set  of 
seed  nodes  S  by 

£(Gi,S,R)=£(S)  +  £(R|S)  . 

Note  that  as  G  is  constant  over  all  seed  sets  S  and  ripples  R,  we  can  safely  ignore  it  in 
the  computation  of  the  total  encoded  size,  for  its  encoded  length  would  be  constant  term 
and  hence  not  influence  the  selection  of  the  best  model. 

By  which  we  can  now  formally  state  our  problem. 

Minimal  Infection  Description  Problem  Given  a  snapshot  of  a  graph  G  (V,  £)  of  N  nodes, 
of  which  the  subgraph  G^V^  fh)  o/Nr  nodes  are  infected,  and  an  infection  probability  (3,  by  the 
Minimum  Description  Length  principle  we  are  after  that  seed  set  S  and  that  valid  propagation 
ripple  R  for  which 

£(Gi,S,R) 

is  minimal  for  the  Susceptible-Infected  propagation  strategy. 

Clearly,  this  problem  entails  a  large  search  space  -  both  in  the  possible  seed-subsets  of 
V i  and  the  possible  propagation  ripples  given  any  seed-set.  In  fact,  as  shown  by  Shah 
and  Zaman  [SZ11],  even  the  problem  of  just  finding  one  MLE  seed  for  a  given  infected 
snapshot  in  an  arbitrary  graph  is  very  hard  (#P-Complete,  equivalent  to  counting  the 
number  of  linear  extensions  of  a  poset).  Further,  the  provable  algorithms  they  give  are 
for  one  seed  on  d-regular  trees  only.  To  tackle  the  problem  on  general  graphs  we  hence 
resort  to  heuristics. 


9.4  Proposed  Method 

The  outline  of  our  approach  is  as  follows:  given  a  fixed  number  of  seeds  k,  we  identify 
a  high-quality  k-seed  set.  Given  these  seed  nodes,  we  optimize  the  propagation  ripple. 
With  these  two  combined,  we  can  use  our  MDL  score  to  identify  the  best  k. 

9.4.1  Best  seed-set  given  number  of  seeds  —  'Exoneration' 

A  central  idea  is  that  intuitively,  un-infected  nodes  should  provide  some  degree  of 
'exoneration'  from  'blame'  for  the  neighboring  infected  nodes.  See  Figure  9.2 — it  shows 
two  illustrative  examples  of  an  infected  chain  (a)  and  a  chain  with  a  star  in  the  middle  (b) 
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(colored  nodes  are  infected  and  blue  denote  the  true  seeds).  Note  that  while  the  node  X  is 
the  most  central  among  the  infected  nodes  and  is  rightly  the  most  likely  seed,  the  node  Y 
is  not  a  likely  seed  because  of  the  many  un-infected  nodes  surrounding  it.  In  fact,  in  this 
case  the  most  likely  starting  points  would  be  the  two  Blue  nodes.  Hence  any  method  to 
identify  the  seed-sets  should  take  into  account  the  centrality  of  the  infected  nodes  among 
the  infected  graph,  but  also  penalize  nodes  for  being  too  close  to  the  un-infected  frontier 
(the  'exoneration').  As  we  explain  next,  our  method  is  able  to  do  this  in  a  principled 
manner. 


X 


(a)  A  chain 


Figure  9.2:  Centrality  is  not  enough  -  effects  of  'exoneration':  Infection  snapshot  examples 
(colored  nodes  are  infected,  blue  nodes  are  the  true  seeds)  (a)  Node  X  is  the  most  central  among 
the  infected  nodes;  (b)  Node  Y  is  the  most  central  among  infected  nodes,  but  the  high  count  of 
non-infected  neighbors  'exonerates'  it. 


9.4.2  Finding  best  single  seed — Our  Main  Idea 

We  first  explain  how  to  find  the  best  single  seed  and  then  how  to  extend  it  to  multiple 
seeds.  Jumping  ahead,  the  main  idea  is  as  follows. 

Main  Idea  The  single  best  seed  s*  is  the  one  with  the  highest  score  in  ui  i.e. 

s*  =  argmaxui(s) 

S 

where  \L\  is  the  smallest  eigenvector  of  the  laplacian  submatrix  LA  as  defined  in  Table  9.2. 
Next,  we  give  the  justification. 

9.4.3  Finding  the  best  single  seed — Justification 

From  Section  9.3,  it  is  clear  that  nodes  that  are  not  in  either  the  final  frontier  set  T  or  Vi 
play  no  role,  as  they  were  not  infectious  nor  could  have  been  infected.  Hence,  WLOG, 
assume  G  contains  only  the  infected  subgraph  G  i  and  the  frontier  set  T.  Also,  assume 
nodes  are  numbered  in  such  a  way  that  the  first  |V  —  Vi|  nodes  are  the  un-infected  nodes 
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and  the  rest  are  the  infected  ones.  If  the  total  number  of  nodes  in  the  graph  is  N,  the 
number  of  infected  nodes  is  N  i,  then  the  number  of  un-infected  nodes  in  G  is  N  —  N  j. 
Further  notation  is  given  in  Table  9.2. 

Let  Xt ft)  be  the  indicator  (0/1)  Random  Variable  denoting  if  node  i  in  the  graph  is 
infected  or  not  at  time  t  (1  =  infected,  0  =  un-infected).  Let  YLj  (t)  be  the  indicator  random 
variable  denoting  if  node  j  successfully  attacks  i  at  time  t.  Consider  the  following  update 
equation  for  any  node  i  e  Vp 

Xi(t  +  1)  =  Xt(t)+  (9.4) 

(l-Xt(t))  x 

V  YqWfXjW-XtW  +  Xtd)) 

jex(t) 

Following  the  above  equation,  if  X|(t)  =  1  then  X L( t  +  1 )  =  1,  i.e.,  once  a  node  is  infected, 
it  stays  infected.  Also  if  X|(t)  =  0,  then  X|(t  +  1)  =  Vjejqi)  Y'ij(t)Xj(t).  Or  in  other 
words,  an  uninfected  node  may  get  infected  only  if  an  infected  neighbor  successfully 
transmits  the  infection.  Additionally  for  any  node  i  e  V  —  Vj,  we  define  XJt)  =  0, 
as  these  nodes  were  not  infected  at  all  during  the  infection  process.  Hence,  the  above 
equations  exactly  define  a  discrete-time  SI  process  but  with  the  constraint  that  the  nodes 
in  the  given  final  frontier  set  always  stay  un-infected,  thus  enforcing  the  'exoneration' 
discussed  before.  Hence  we  want  to  find  the  seed  node  which  maximizes  spread  in  this 
'constrained'  epidemic,  which  we  show  how  to  next. 

For  any  node  i  G  V\,  taking  expectations  both  sides  of  Equation  9.4,  and  using  the 
fact  that  for  any  indicator  random  variable  X,  E[X]  =  Pr(X  =  1),  we  get: 

Pi(t-Fl)  =  Pi(t)+U-V  (9.5) 

where. 


Pi(t)  =  Pr(Xi(t)=l) 
U  =  E 


V  Yij(t)(Xj(t)-Xi(t)  +  Xi(t)) 

j€N(i) 


V  =  E 


Xt(t)x  \/  Yij(t)(Xj(t)-Xi(t)  +  Xi(t)) 

jew(i) 

Clearly,  as  all  the  terms  inside  are  positive, 

V  >  0,U  >  0 


Also, 

U  ^  Y.  A(G)ij(Pj(t)-Pi(t)  +  Pi(t)) 
je>r(i) 

=  Y  MG)nPdt)+  Y  A(G)ij(Pj(t)-Pi(t)) 

jeK(i) 


(9.6) 
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as  an  infected  node  j  attacks  any  of  its  neighbors  i  independently  with  probability 
A(G)ij  (i.e.  E[Y|j(t)]  =  A(G)tj)  and  because  by  linearity  of  expectation,  for  any  two 
events  indicator  random  variables  1a  and  1b,  we  have  IaVIb  =  1a  +  1b_  IaIb 
E[1a  V  1B]  ^  E[1a]  +  E[1b].  Also  note  that: 

L  AtonP.w  <  d-raax  X  Pi(t)  (9.7) 

jeK(i) 

where  dmax  is  the  largest  degree  in  graph  G.  Thus, 

U^dmQXP|(t)+  Y_  A(G)ij(Pj(t)-Pt(t))  (9.8) 

From  Equations  9.6  and  9.8,  we  can  conclude  that,  for  each  node  i.  e  V\. 

Pi(t+1)  ^  P-i(t)  +  dmaxPi(t) 

+  Y_  A(G)ij(Pj(t)  —  Pt(t)) 

Let  a  =  1  +  dmax.  Recall  that  Vt,  Pt  (t)  =  0  for  any  eventual  un-infected  node  i  e  V— Vi. 
Let  P(t)  =  [Pi(t),  P 2(f), . . . ,  PN  (t)]T  (over  all  the  nodes  in  V).  Then  we  can  write: 

P(t+lKa(I--M)P(t)  (9.9) 

a 

where,  the  matrix  M  (size  N  x  N)  is: 

On-n^n-Nj  On-n^Nj 
On^n-Nj  La 

where  we  write  On,m  for  an  all-zeros  matrix  of  size  N  x  M.  Let  the  subvector  of  P(t  +  1) 
corresponding  to  the  infected  nodes  be  written  as  Pj  (t  +  1).  Then  continuing  from  above 
and  using  the  upper  bound  as  an  approximation,  we  get: 

Pi(t  +  1)  «  cr(I  —  — LA)Pi(t)  (9.10) 

O' 

=  a{l--LA)t?l{0)  (9.11) 

O' 

=  o'  ^  Aluiu-rPI(0)  (9.12) 

i 

where,  AL  and  U|  are  the  eigenvalues  and  eigenvectors  of  the  matrix  I-^La.  We  have 
the  following  two  lemmas: 

Lemma  9.1.  The  largest  eigenvalue  M  and  eigenvector  ui  of  the  matrix  I  —  ^LA  are  all  positive 
and  real. 
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Proof.  (Details  omitted  for  brevity)  The  matrix  I  —  ^LA  is  non-negative,  and  imagining 
I  —  ^LA  as  an  adjacency  matrix,  the  corresponding  graph  is  irreducible,  as  graph  Gi 
(adjacency  matrix  A)  is  connected.  We  then  get  the  lemma  due  to  the  Perron-Frobenius 
theorem  [McCOO].  □ 

Lemma  9.2.  The  largest  eigenvalue  of  matrix  I  —  ^LA  and  the  smallest  eigenvalue  of  LA  are 
related  as  Ai(I  —  ^LA)  =  1  —  ^AN  (LA). 

Proof  (Details  omitted  for  brevity)  It  is  easy  to  see  that  any  eigenvalue  eig  ( I  —  ^LA)  = 
1  —  eigf-f  La).  As  the  matrices  are  symmetric,  all  the  eigenvalues  involved  are  real.  By  the 
Cauchy  eigenvalue  interlacing  theorem  [Str88]  applied  to  L(G),  all  the  eigenvalues  of  any 
co-factor  Clg  of  L(G)  are  positive.  By  the  famous  Kirchhoff's  matrix  theorem  [CDS98], 
the  determinant  of  any  co-factor  Clg  is  also  non-zero  as  it  counts  the  number  of  spanning 
trees  of  G.  Also,  it  is  well-known  that  the  determinant  of  any  matrix  is  just  the  product 
of  its  eigenvalues  [Str88].  Hence,  all  eigenvalues  of  any  co-factor  matrix  CLg  of  L(G) 
are  strictly  positive.  We  can  similarly  apply  eigenvalue  interlacing  successively  to  a 
suitable  Clg  and  so  on  till  we  get  to  LA  (a  principal  submatrix  of  L(G)),  and  get  that  all 
eigenvalues  of  LA  are  strictly  positive.  The  lemma  follows  then.  □ 


Hence,  the  eigenvector  Ui  is  also  the  eigenvector  corresponding  to  the  smallest 
eigenvalue  of  LA. 

Now,  from  Equation  9.12  and  Lemma  9.1,  we  have: 


Pi(t  + 1) 


crAjui  u}  Pi(0) 


(9.13) 

(9.14) 


assuming  a  substantial  eigen-gap  or  'big-enough' t.  Now  assuming  that  Pi(0)  is  all 
zero  except  for  a  single  seed  s  for  which  it  is  1,  we  can  conclude  that  ultimately  in  our 
'constrained'  epidemic. 


Vi  e  Vi,  Pr(Xt  =  l|s)  ^  ui(i)ui(s) 
Vie  V-VIf  Pr(Xi  =  l|s)  =  0 


Clearly  the  most  likely  single  seed  s*  would  be: 

Y_  Pr(X|  =  1 1 s ) 

.iev, 

+  Y_  (1  —  Pr(Xt  =  l|s)) 

iev-Vi 


s*  =  argmax 


(9.15) 

(9.16) 
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Using  Equations  9.15  and  9.16, 

s*  f=a  argmaxui(s)  /  ui(i) 

S  * 

ieVi 

—  argmaxu^s)  (9.17) 

S 

Hence,  for  a  single  seed,  we  just  need  to  find  the  node  with  the  largest  score  in  ui  (which 
is  also  the  smallest  eigenvector  of  the  laplacian  submatrix  LA  from  Lemma  9.2). 

9.4.4  Finding  best  k-seed  set 

Note  that  simply  taking  the  top-k  in  the  above  eigenvector  will  not  give  good  k-seed-sets 
due  to  lack  of  diversity.  This  is  because  the  error  in  the  upper-bound  approximation  used 
in  Equation  9.11  will  become  larger  due  to  increase  in  the  norm  of  Pi(0).  Hence,  we  treat 
the  newly  chosen  seed,  say  s*,  as  un-infected,  effectively  exonerating  its  neighbors  and 
boosting  diversity.  We  redo  our  computation  on  the  resulting  smaller  infected  graph,  but 
a  potentially  larger  frontier  set — hence,  we  take  the  next  best  seed  given  the  s*  that  has 
already  been  chosen.  So  for  any  given  k,  we  successively  find  the  best  next  seed,  given 
the  previous  choices,  by  removing  the  previously  chosen  seeds  from  the  infected  set  and 
solving  Equation  9.17.  For  example,  in  Figure  9.1,  the  top  suspect  (Red  on  the  right)  will 
have  a  lot  of  suspicious  neighbors  as  well.  Thus,  using  our  exoneration  technique,  the 
algorithm  will  be  forced  away  from  them  towards  the  remaining  Red  seed. 

9.4.5  Finding  a  good  ripple 

As  discussed  before,  once  we  find  the  best  seed-set  Sk  for  a  given  k,  we  optimize  the 
propagation  ripple  of  Sk  to  Gi  to  minimize  the  total  encoded  size.  Recall  from  Section  9.3 
that  this  involves  minimizing  £(R  |  S),  which  consists  of  two  terms.  First,  we  have  the 
cost  of  encoding  the  length  of  the  ripple,  the  number  of  time-steps.  While  Ui  does  grow 
for  higher  values  of  T,  in  practice  this  term  will  be  dwarfed  by  the  actual  encoding  of 
the  subsequent  frontier  sets.  As  such,  minimizing  £(R  |  S)  essentially  comes  down  to 
minimizing  —  L  (fU),  or,  in  other  words,  maximizing  the  likelihood  of  the  ripple  R. 
Further  recall  that  the  SI  model  has  a  natural  scaling  invariance,  1/(3.  As  our  score  takes 
this  into  account,  the  ripple  with  the  smallest  description  length  should  too. 

Hence,  we  design  the  following  procedure.  For  each  attack-degree  set  Tj,  at  any 
iteration  we  scale  the  number  of  attacks  by  1/(3  i.e.  a  set  of  size  fa  is  equivalent  to  a  set 
of  size  fa/ (3.  Then,  to  get  the  overall  MLE  ripple,  we  adopt  the  following  heuristic.  We 
assume  that  the  overall  MLE  ripple  always  performs  a  locally  optimal  next  step.  Hence 
this  boils  down  to  choosing  the  most-likely  nodes  to  get  infected  at  any  given  step,  for  a 
given  frontier  set  T. 

It  is  well-known  that  a  Binomial  distribution  B(n,p)  has  its  mode  at  |_(n  +  l)pj .  Using 
this  fact,  at  any  iteration  t,  taking  into  account  the  scaling,  we  can  see  that  the  most  likely 
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Algorithm  6  NETSLEUTH 

Input:  G(V,  £)  =  Gj  U  tF*,  Gj  (VIr  £x)  (the  infected  graph)  and  P  (the  frontier  set). 
Output:  S  =  the  set  of  seeds  (culprits). 

1:  L(G)  =  D(G)  —  A(G),  the  Laplacian  matrix  corresponding  to  graph  G. 

2:  §  =  {} 

3:  Gi  =  Gj 

4:  while  L  ( G i,  S,  R)  decreases  do 

5:  La  =  the  submatrix  of  L(G)  corresponding  to  Gi. 

6:  v  =  eigenvector  of  LA  corresponding  to  the  smallest  eigenvalue. 

7:  next  =  argmaxt  v(i) 

8:  S  =  S  U  {next} 

9:  R  =  ripple  maximizing  likelihood  of  G  i  from  S 

10:  Gi  =  Gi \{next}  (Graph  Gi  with  node  next  removed) 

11:  end  while 
12:  return  S 


number  of  nodes  infected  in  an  attack-degree  set  Td  would  be  nd  =  |_(fd/|3  +  1)  x  pdJ — 
where  pd  as  defined  before  in  Section  9.3  is  the  attack  probability  in  the  set  Td.  As  such, 
we  can  simply  uniformly  choose  this  number  of  nodes  from  the  Td,  as  each  node  in  £Fd  is 
equally  likely  to  be  infected.  We  do  this  for  every  non-empty  attack-degree  set,  for  every 
iteration,  until  we  have  infected  exactly  the  observed  snapshot.  This  way,  we  obtain  a 
most  likely  propagation  ripple  for  any  given  seed-set  Sk  and  can  subsequently  score  it 
using  MDL. 

Finally,  we  stop  getting  more  seeds  when  the  MDL  score  for  Sk  increases  as  we 
increase  k.  Algorithm  6  gives  the  pseudo-code  and  Lemma  9.3  shows  the  running  time 
for  our  algorithm  NETSLEUTH. 

Lemma  9.3  (Running  Time  of  NETSLEUTH).  The  time  complexity  of  NETSLEUTH  is 
£f  +  V0). 


Proof.  We  keep  finding  Sk  for  each  seed-set  size  until  MDL  tells  us  to  stop.  Hence  the 
running  time  is  0(k*(£i  +  Triple  +  TMdl)),  if  k*  is  the  optimal  seed-set  size  and  TMdl 
is  the  running  time  of  computing  the  MDL  score  given  the  seed  set  size  is  k*.  Here  we 
used  the  fact  that  calculating  the  eigenvector  using  the  Lanczos  method  is  approximately 
0(E)  (#  edges)  for  sparse  graphs. 

The  worst-case  complexity  TMdl  of  calculating  £(Gi,S,  R)  for  a  given  Gi,  S,  and  R,  is 
0(£i  +  £f  +  Vi).  The  £(S)  term  is  0(1).  For  the  £(R  |  S)  term,  we  need  to  iterate  over  the 
ripple,  which  is  at  most  Vi  steps  long.  We  only  have  to  update  the  frontier  set  T  when 
one  or  more  nodes  got  infected,  for  which  we  then  have  to  update  the  attack  degrees  of 
the  nodes  connected  to  the  nodes  infected  at  time  t.  Hence  we  traverse  every  edge  in 
£i  +  £f,  and  every  node  in  Vi,  which  gives  it  the  complexity  ofO(£i  +  £F  +  Vi). 
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Finally,  the  running  time  TRIpPLE  of  computation  of  the  MLE  ripple  for  a  given  Sk  is 
also  0(£i  +  £f  +  Vi). 

So  the  overall  complexity  of  NetSleuth  is  0(k*(£i  +  £f  +  W)).  □ 

Hence  NetSleuth  is  linear  in  the  number  of  edges  and  vertices  of  the  infected 
sub-graph  and  the  frontier  set,  which  makes  our  method  scalable  for  large  graphs  (as 
compared  to  the  methods  in  [SZ11,  LTGM10]  which,  even  for  detecting  a  single  seed,  are 
0(N2)). 

9.5  Experiments 

Here  we  experimentally  evaluate  NetSleuth,  in  particular  its  effectiveness  in  finding 
culprits — whether  it  correctly  identifies  (a)  how  many  as  well  as  (b)  which  ones — and  its  (c) 
scalability 

9.5.1  Experimental  Setup 

We  implemented  NetSleuth  in  Matlab,  and  in  addition  we  implemented  the  SI  model  as 
a  discrete  event  simulation  in  C++.  All  reported  results  are  averaged  over  10  independent 
runs  (so  we  generate  10  graphs  for  each  seed  set). 

In  our  study  we  use  both  synthetic  and  real  networks — we  chose  synthetic  networks 
exemplifying  different  types  of  situations.  We  consider  the  following  networks: 

1.  GRID  is  a  60x60  2D  grid  as  shown  in  Figure  9.1. 

2.  CHAIN-STAR  It  is  a  graph  of  total  107  nodes.  The  first  7  nodes  form  a  linear  chain 
and  the  middle  node  has  100  additional  neighbors. 

3.  AS -OREGON  The  Oregon  AS  router  graph  which  is  a  network  graph  collected  from 
the  Oregon  router  views.  It  contains  15  420  links  among  3  995  AS  peers.2 

For  the  experiments  on  AS -OREGON,  we  ran  the  experiments  for  true-seed  count 
k*  =  1, 2, 3.  So  for  each  seed-set,  we  run  a  simulation  till  at  least  30%  of  the  graph  is 
infected,  and  give  the  resulting  footprint  as  input  to  NETSLEUTH.  Note  that,  the  larger 
the  number  of  infections,  the  tougher  it  is  to  find  the  true  seeds,  as  in  the  SI  model  any 
seed  will  eventually  infect  the  whole  graph  with  certainty.  Finally,  we  make  sure  that  the 
infected  sub-graph  was  connected — otherwise,  we  just  have  separate  problem  instances. 

As  discussed  in  the  introduction  and  Section  9.6,  the  existing  proposals  for  identifying 
culprits  consider  significantly  different  problems  settings  than  we  do  (see  Table  9.1);  the 
Rumor  Centrality  of  Shah  and  Zaman  [SZ10,  SZ11]  can  only  discover  one  seed  node, 
while  Effectors  of  Lappas  et  al.  [LTGM10]  even  consider  a  completely  different  infection 
model.  As  such  we  can  not  meaningfully  compare  performances,  and  hence  here  only 
consider  NETSLEUTH. 

2For  more  information  see  http :  / /topology .  eecs  .  umich  .  edu/data  .  html. 
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(e)  Scatter  plot  of  Jaccard  scores 
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Figure  9.3:  Effectiveness  of  NetSleuth  in  answering  both  Hoiv  many  and  Which  ones  -  Results  of 
our  experiments  on  the  AS-OREGON  graph  for  true-seed-count  k*  =1,2,3  (rows,  subfigures  (a-c), 
(d-f),  (g-i)  respectively).  First  column,  (a),(d),(g),  MDL  scores  as  a  function  of  k  found  by  Net- 
Sleuth  are  near-convex;  also  we  recover  the  true  number  in  all  cases.  Second  column,  (b),(e),(h), 
scatter  plots  of  Jaccard  scores  (JDX[V i))  of  NetSleuth  seeds  (y-axis)  and  the  corresponding  true 
seeds  (x-axis).  On  or  below  the  45-degree  line  is  better.  Each  point  average  of  10  runs.  Note  that 
for  many  runs  the  seeds  identified  by  NetSleuth  score  exactly  or  even  better  than  the  true  seeds. 
Third  column,  (c),(f),(i),  average  Qmdl  and  Qjd  scores  for  the  seeds  returned  by  NetSleuth. 
Each  bar  represents  the  average  over  90  different  seed-sets.  Note  that  all  the  bars  are  close  to  1, 
indicating  that  we  consistently  find  high-quality  seed  sets  both  with  the  Jaccard  measure,  and 
with  the  MDL  measure. 
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Evaluation  Function — a  subtle  issue 


How  to  evaluate  the  goodness  of  a  seed  set?  That  is,  in  Figure  9.1,  how  close  are  the 
red  seeds  (recovered)  from  the  blue  seeds  (true)?  Notice  that  the  recovered  seeds  may 
actually  have  better  score  than  the  actual  ones,  for  the  same  reason  that  the  sample 
mean  of  a  group  of  ID  Gaussian  instances  gives  lower  sum-squared-distances  than  the 
theoretical  mean  of  the  distribution.  Moreover,  even  for  evaluation,  it  is  intractable  to 
compute  the  exact  probability  of  observing  the  footprint  from  a  given  seed-set. 

Thus  we  propose  two  quality  measures.  The  first,  Qmdl/  is  based  on  our  MDL:  we 
report  the  ratio  of  the  MDL  score  of  our  seeds,  vs.  the  MDL  score  of  the  actual  seeds  i.e. 


Qmdl 


£(Gx,g,R) 

£(Gi,S*,R*) 


(9.18) 


Clearly,  the  closer  to  1,  the  better. 

The  second  QjD  intuitively  measures  the  overlap  of  the  footprint  produced  by  a  seed- 
set  S  and  the  input  footprint  G i (T?!,  £x).  Clearly,  the  candidate  seed-set  S  can  produce  n 
footprints,  when  we  run  n  simulations;  so  we  compute  E \JDs[Vi)],  the  average  Jaccard 
distance3  of  all  these  n  footprints,  w.r.t.  the  true  input  footprint  Vi.  As  with  Qmdl/  we 
normalize  it  with  the  corresponding  score  E[/D§»(Vi)]  for  the  true  seed-set,  and  thus 
report  the  ratio. 


_  E[/D§(Vi)] 
^JD  E  [JDsQVj)] 


(9.19) 


Again,  the  closer  to  1,  the  better. 


9.5.2  Effectiveness  of  NetSleuth  in  identifying  How  Many 

In  short,  NETSLEUTH  was  able  to  find  the  exact  number  of  seeds  for  all  the  cases. 
Figures  9.3(a),(d),(g)  show  the  MDL  score  as  a  function  of  k  =  1, 2, . . . ,  6  seeds  found  by 
NetSleuth  before  stopping,  for  true  seed-sets  with  (a)  k*  =  1,  (b)  k*  =  2  and  (c)  k*  =  3 
respectively  on  the  AS -OREGON  network.  Note  that  the  plots  show  near-convexity,  with 
the  minimum  at  the  true  k*,  justifying  our  choice  of  stopping  after  j  =  6  iterations  of 
increasing  scores.  It  also  shows  the  power  of  our  approach,  as  we  can  easily  recover  the 
true  number  of  seeds  using  a  principled  approach. 


9.5.3  Effectiveness  of  NetSleuth  in  identifying  Which  Ones 

In  short,  in  addition  to  finding  the  correct  number  of  seeds,  NETSLEUTH  is  able  to  identify 
good-quality  seeds  with  high  accuracy.  For  our  synthetic  graphs,  NETSLEUTH  is  able  to 
point  out  that  there  must  have  been  exactly  2  seeds  for  both  the  GRID  and  CHAIN-STAR 

3  We  use  the  standard  definition  of  Jaccard  Distance  between  two  sets  A  and  23  =  1  —  . 
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examples — respectively  identified  as  the  Red  circles  in  Figure  9.1,  and  the  Blue  nodes  in 
Figure  9.2(b)),  agreeing  with  the  ground-truth  and  intuition. 

Figures  9.3(b-c),(e-f),(h-i)  show  the  results  of  our  experiments  for  different  number 
k*  =  1, 2, 3  of  true  seeds  on  the  AS -OREGON  graph.  We  randomly  selected  90  seed-sets  of 
each  size.  We  made  sure  that  the  seed-sets  contained  both  well-connected  and  weakly 
connected  nodes.  Each  point  is  an  average  of  10  runs. 

Firstly,  although  not  shown  in  the  figures,  NetSleuth  was  able  to  perfectly  recover 
the  true  number  of  seeds  in  almost  all  cases.  For  each  seed-set,  we  calculate  JD§(Vi] 
for  the  seeds  returned  by  NetSleuth  and  the  true  seeds  and  give  the  scatter  plot  in 
Figures  9.3(b)(e)(h)  for  true-seed  count  k*  =  1,2,3  respectively  (rows).  Hence  points 
on  or  below  the  45-degree  line  (solid  blue)  are  better.  Clearly,  almost  all  points  are 
concentrated  near  the  diagonal,  showing  high  quality.  In  fact,  many  points  are  exactly 
on  the  line,  meaning  we  are  able  to  recover  the  true  seeds  perfectly  for  many  cases.  We  do 
not  show  similar  plots  with  our  MDL  score  due  to  lack  of  space. 

Next,  we  calculate  Qmdl  and  Qjd  averaged  over  all  the  different  seed-sets  (of  the 
same  size  for  k*  =  1,2,3).  Results  are  shown  in  the  bar  plots  (third  column)  of  Fig¬ 
ures  9.3(c)(f)(i).  The  true-seed  scores  are  represented  by  the  dotted  line  at  1,  for  both 
Qmdl  and  QjD.  Clearly,  all  of  bars  are  close  to  1,  demonstrating  that  NETSLEUTH  consis¬ 
tently  finds  very  good  culprits.  Moreover,  both  Qmdl  and  Qjd  quality  metrics  are  similar 
in  magnitude  for  all  k*'s  —  increasing  our  confidence  in  our  results. 

9.5.4  Scalability 
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Figure  9.4:  NetSleuth  Scalability:  Wall-clock  running  time  (in  seconds)  for  increasingly  larger 
infected  snapshots  of  AS-OREGON  (as  the  complexity  just  depends  on  the  size  of  the  snapshot) 
k*  =  1.  Each  point  average  of  10  runs.  Note  that,  as  expected,  the  running  time  scales  linearly. 


Figure  9.4  demonstrates  the  scalability  of  NETSLEUTH  after  running  it  on  increasingly 
larger  infected  snapshots  of  AS-OREGON  (as  the  complexity  just  depends  on  the  size 
of  the  snapshot).  As  expected  from  our  Lemma  9.3,  the  running- time  is  linear  on  the 
number  of  edges  of  the  infected  graph. 
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9.6  Related  Work 


As  mentioned  in  the  introduction,  although  diffusion  processes  have  been  widely  studied, 
the  problem  of  'reverse  engineering'  the  epidemic  has  not  received  much  attention, 
except  papers  by  Shah  and  Zaman  [SZ10,  SZ11]  and  Lappas  et  al.  [LTGM10].  Shah  and 
Zaman  [SZ10,  SZ11]  formalized  the  notion  of  rumor-centrality  for  identifying  the  single 
source  node  of  an  epidemic  under  the  SI  model,  and  showed  an  optimal  algorithm  for 
d-regular  trees.  Lappas  et  al.  [LTGM10]  study  the  problem  of  identifying  k  seed  nodes, 
or  effectors  of  a  partially  activated  network,  which  is  assumed  to  be  in  steady-state  under 
the  IC  (Independent-Cascade)  model.  In  contrast,  we  allow  for  (a)  multiple  seed  nodes, 
(b)  a  snapshot  from  any  time  during  the  infection,  and  (c)  find  the  number  of  seeds 
automatically,  even  for  general  graphs.  Finally  we  are  also  more  efficient  with  linear  time 
on  edges  of  the  infected  graph. 

Rest  of  the  related  work  into  areas  deals  with  epidemic/ cascade-style  processes 
and  problems  related  to  them  like  epidemic  thresholds,  immunization  and  influence 
maximization  (most  of  which  we  have  already  seen  previously). 

Influence  Maximization  An  important  problem  under  the  viral  marketing  setting  is 
the  influence  maximization  problem  [RD02,  KKT03,  GLL11,  CWW10,  HBW11],  Another 
remotely  related  work  is  outbreak  detection  [LKG+07]  in  the  sense  that  we  aim  to  select 
a  subset  of  'important'  nodes  on  graphs. 

MDL  We  are  not  the  first  to  use  the  Minimum  Description  Length  principle  [Gr7] 
for  a  data  mining  purpose.  Faloutsos  and  Megalooikoumou  [FM07]  argue  many  data 
mining  problems  are  related  to  Kolmogorov  Complexity,  which  means  they  can  be 
practically  solved  through  compression — examples  include  clustering  [CV05],  pattern 
mining  [VvSll],  and  community  detection  [CPMF04],  We  are,  to  the  best  of  our  knowl¬ 
edge,  however,  the  first  to  employ  MDL  with  the  goal  of  identifying  culprits. 

9.7  Conclusions 

In  this  chapter  we  discussed  finding  culprits,  the  challenging  problem  of  identifying  the 
nodes  from  which  an  infection  in  a  graph  started  to  spread.  We  proposed  to  employ 
the  Minimum  Description  Length  principle  for  identifying  that  set  of  seed  nodes  from 
which  the  given  snapshot  can  be  described  most  succinctly.  We  introduced  NETSLEUTH 
(based  on  a  novel  'submatrix-laplacian'  method),  a  highly  efficient  algorithm  for  both 
identifying  the  set  of  seed  nodes  that  best  describes  the  given  situation,  and  automatically 
selecting  the  best  number  of  seed  nodes — in  contrast  to  the  state  of  the  art. 

Experiments  showed  NETSLEUTH  attains  high  accuracy  both  in  detecting  the  seed 
nodes,  and  correctly  identifying  their  number.  Importantly,  NETSLEUTH  scales  linearly 
with  the  number  of  edges  of  the  infected  graph,  0(£f  +  £i  +  F’i)  /  making  it  applicable  on 
large  graphs. 
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Part  III 
Models 
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Overview 


In  this  part,  we  turn  our  attention  to  building  expressive  models  from  real-data  for  vari¬ 
ous  propagation  scenarios,  using  insights  gained  so-far.  Apart  from  having  explanatory 
power,  we  will  see  how  these  models  can  also  perform  challenging  tasks  like  forecasting 
and  anomaly  detection.  We  address  two  main  settings  here: 

•  Models  for  Information  Diffusion:  How  quickly  does  a  piece  of  news  spread 
over  these  media?  Do  the  rise  and  fall  activity  patterns  in  online  media  have  any 
underlying  laws'?  After  analyzing  numerous  real  datasets,  we  present  SpikeM, 
a  general,  accurate  and  succinct  model  that  explains  such  rise-and-fall  patterns, 
unifies  previous  models  and  patterns,  is  interpretable  enabling  effective  forecasting 
of  trends,  and  answering  'what-if '  scenarios. 

•  Models  for  Competing  Tasks:  If  'Alice'  has  nA=50  contacts  and  did  nB=100  phone- 
calls  to  them,  what  should  we  expect  for  'Bob',  who  has  twice  the  contacts?  One 
would  expect  a  linear  relationship  (double  the  contacts,  double  the  phone-calls). 
However,  we  show  that  in  numerous  settings  (like  tweets  vs  re-tweets),  the  rela¬ 
tionship  is  a  power  law,  being  sub-  or  super-linear.  We  also  present  a  simple  model, 
CTM,  based  on  competing  species  which  succinctly  explains  the  prevalence  of  such 
power-laws,  and  then  subsequently  use  to  spot  outliers  like  telemarketers  as  well. 
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Chapter  10 

Rise  and  Fall  in  Information  Diffusion 


The  recent  explosion  in  the  adoption  of  search  engines  and  new  media  such  as  blogs 
and  Twitter  have  facilitated  faster  propagation  of  news  and  rumors.  How  quickly  does  a 
piece  of  news  spread  over  these  media?  How  does  its  popularity  diminish  over  time? 
Does  the  rising  and  falling  pattern  follow  a  simple  universal  law? 

In  this  chapter,  we  propose  SpikeM,  a  concise  yet  flexible  analytical  model  for  the  rise 
and  fall  patterns  of  influence  propagation.  Our  model  has  the  following  advantages:  (a) 
unification  power:  it  generalizes  and  explains  earlier  theoretical  models  and  empirical 
observations;  (b)  practicality:  it  matches  the  observed  behavior  of  diverse  sets  of  real 
data;  (c)  parsimony:  it  requires  only  a  handful  of  parameters;  and  (d)  usefulness:  it 
enables  further  analytics  tasks  such  as  forecasting,  spotting  anomalies,  and  interpretation 
by  reverse-engineering  the  system  parameters  of  interest  (e.g.  quality  of  news,  count  of 
interested  bloggers,  etc.). 

Using  SpikeM,  we  analyzed  7.2GB  of  real  data,  most  of  which  were  collected  from 
the  public  domain.  We  have  shown  that  our  SpikeM  model  accurately  and  succinctly 
describes  all  the  patterns  of  the  rise-and-fall  spikes  in  these  real  datasets. 


10.1  Introduction 

How  do  spikes  behave  in  social  media?  Online  social  media  is  spreading  news  and 
rumors  in  new  ways  and  search  engines  have  facilitated  such  spreading  magnificently, 
creating  bursts  and  spikes.  Some  rumors  (or  memes,  hashtags)  start  slowly  and  linger; 
others  spike  early  and  then  decay;  others  show  more  complicated  behavior,  as  we  show 
in  Figure  10.1. 

Do  real  rise-and-fall  patterns  have  any  qualitative  differences?  Do  they  form  dif¬ 
ferent  classes?  If  yes,  how  many?  Earlier  work  on  Youtube  data  claims  there  are  four 
classes  [CS08].  Empirical  work  found  six  classes  [YL11],  How  many  classes  are  there 
after  all? 

Our  answer  is:  one.  We  provide  a  unifying  model,  SpikeM,  that  requires  only  a 
handful  of  parameters,  and  we  show  that  it  can  generate  all  patterns  found  in  real  data 
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Figure  10.1:  Modeling  power  of  SpikeM:  six  types  of  spikes  (K-SC  from  [YL11])  shown  as  dots, 
and  our  model  fit  in  solid  red  line.  Data  sequences  span  over  120  time-ticks,  while  SpikeM 
requires  only  seven  parameters.  The  fit  is  so  good,  that  the  red  line  is  often  invisible,  due  to 
occlusion. 
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simply  by  changing  the  parameter  values. 

Figure  10.1  shows  six  representative  spikes  of  online  media  (memes)  from  K-SC  [YL11], 
as  gray  circles,  as  well  as  our  fitted  model,  as  a  solid  red  line.  Notice  that  the  fitting  is 
very  good,  despite  the  fact  that  our  SPIKEM  model  requires  only  seven  parameters,  and 
that  the  time-sequences  span  120  intervals. 

Informally,  the  problem  we  want  to  solve  is  to  model/predict  an  activity  (e.g.,  number 
of  blog  postings),  as  a  function  of  time,  given  some  breaking-news  at  a  given  timetick. 
We  will  use  a  blogger  example  for  brevity  and  clarity,  but  many  other  processes  could  be 
also  modeled  (people  buying  products,  computer  viruses  infecting  machines,  rumors 
spreading  over  Twitter,  etc).  Thus,  we  have: 

Informal  Problem  1  (what-if).  Given  a  network  of  bloggers  (/hosts/buyers),  a  shock  (e.g., 
event)  at  time  nb,  the  interest /quality  of  the  event,  the  count  Sb  of  bloggers  that  immediately  (= 
time  nb)  blog  about  the  event,  find  how  the  blogging  activity  will  evolve  over  time. 

A  closely  related  problem  is  to  develop  a  parsimonious  model,  that  can  be  made  to  fit 
several  spikes  observed  in  the  past  (as  we  do  in  Figure  10.1).  That  is. 

Informal  Problem  2  (model  design).  Given  the  behavior  of  several  spikes  in  the  past,  find 
an  equation /model  that  can  explain  them,  with  as  few  parameters  as  possible. 

It  would  be  good  if  the  parameters  had  an  intuitive  explanation  (like,  'number  of 
bloggers',  'quality  of  news',  etc,  as  opposed  to,  say,  cp,  a2  of  an  autoregressive  model 
(AR/ARIMA)). 

In  this  work,  we  propose  SpikeM  model  to  solve  both  of  the  aforementioned  problems. 
Our  SPIKEM  has  the  following  advantages: 

•  Unification  power:  it  includes  earlier  patterns  and  models  as  special  cases  ([YL11, 
TBK09]), 

•  Practicality:  it  matches  the  behavior  of  numerous,  diverse,  real  datasets,  including 
power-law  decay 

•  Parsimony:  our  model  requires  only  a  handful  of  parameters 

•  Usefulness:  thanks  to  the  SPIKEM  model,  we  can  answer  'what-if'  questions  (see 
subsection  10.5.1),  spot  outliers,  reverse-engineer  the  system  parameters  (quality  of 
news,  count  of  interested  bloggers,  time-of-day  behavior  of  bloggers) 

Our  SpikeM  model  is  enabled  by  a  careful  design  to  incorporate  (a)  the  power-law 
decay  in  infectivity,  (b)  a  finite  population,  and  (c)  proper  periodicities.  Earlier  models 
ignored  one  or  more  of  the  above  issues. 

Thanks  to  the  practicality  of  SpikeM,  we  can  make  forecasting,  analysis  of  'what-if' 
scenarios,  and  detection  of  anomalies,  as  we  show  in  section  10.4  and  section  10.5.  We 
should  highlight  that  traditional  AR,  ARIMA  and  related  linear  models  are  fundamen¬ 
tally  unsuitable,  because  they  are  linear  (and  can  diverge  to  infinity)  and  because  they 
lead  to  exponential  decays  (as  opposed  to  the  power  law  that  reality  seems  to  obey). 
Table  10.1  illustrates  the  relative  advantages  of  our  method:  the  C-S  method  (Crane  and 
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c-s 

K-SC 

SI 

AR 

SpikeM 

System  identification 

V 

V 

Non-linear 

V 

Power  law  decay 

V 

Periodicity 

V 

V 

Forecasting 

V 

Table  10.1:  Capabilities  of  approaches.  Only  our  approach  meets  all  specs. 


Sornette)  [CS08]  assumes  an  infinite  population  of  bloggers;  the  clusters  in  K-SC  [YL11] 
(repeated  in  Figure  10.1)  are  non-parametric  and  are  incapable  of  forecasting.  The  SI 
model  (closely  related  to  the  Bass  model  [Bas69]  of  the  market  penetration  of  new  prod¬ 
ucts)  leads  to  exponential  decay,  as  opposed  to  the  power-law  decay  that  we  observe  in 
real  data. 

Outline  The  rest  of  the  chapter  goes  as  follows:  Section  10.2  presents  an  overview  of 
the  related  work  and  Section  10.3  the  proposed  model.  Sections  10.4  and  10.5  show  our 
experimental  results  on  a  variety  of  datasets.  We  conclude  in  section  10.7. 


10.2  Background 

In  this  section,  we  present  the  fundamental  concepts. 

Epidemiology  fundamentals.  The  most  basic  epidemic  model  is  the  so-called 
'Susceptible-Infected'  (SI)  model.  Each  object/ node  is  in  one  of  two  states  -  Susceptible  (S) 
or  Infected  (I).  Each  infected  node  attempts  to  infect  each  of  its  neighbors  independently 
with  probability  (3,  which  reflects  the  strength  of  the  virus.  Once  infected,  each  node 
stays  infected  forever.  If  we  assume  that  the  underlying  network  is  a  clique  of  N  nodes, 
and  use  our  notation  ('B'  for  blogged  =  infected)  the  most  basic  form  of  the  model  is: 

=  |3  *  (N  —  B(t))B(t)  (10.1) 

where  the  time  t  is  considered  continuous,  dB/dt  is  the  derivative,  and  the  initial 
condition  reflects  the  external  shock  (say,  B(0)  =  b  externally  infected  people).  The 
justification  is  as  follows:  (3  is  the  strength  of  the  virus,  that  is,  the  probability  that 
an  encounter  between  an  infected  person  ('B')  and  an  uninfected  one,  will  end  up  in 
an  infection  -  and  we  have  B  *  (N  —  B)  such  encounters.  The  solution  for  BQ  is  the 
sigmoid,  and  its  derivative  is  symmetric  around  the  peak,  with  an  exponential  rise  and 
an  exponential  fall  (we  discuss  later  in  Figure  10.2).  There  we  also  show  the  weakness  of 
the  SI  model:  real  data  have  a  power-law  'fall'  pattern. 

Self-excited  Hawkes  process.  Crane  et  al.  [CS08]  used  a  self-excited  Hawkes  con¬ 
ditional  Poisson  process  [H074]  to  model  YouTube  views  per  day,  showing  that  spikes 
in  the  activity  have  a  power-law  rise  pattern,  and  a  power-law  fall  pattern,  depending 
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on  the  model  parameters.  Roughly,  the  Hawkes  process  is  a  Poisson  process  where 
the  instantaneous  rate  is  not  constant,  but  depends  on  the  count  of  previous  events, 
whose  effect  drops  with  the  age  t  of  the  event.  That  is,  if  there  were  a  lot  of  events 
(viewings /bloggings)  recently,  we  will  have  many  such  events  today. 

The  base  model  states  that  the  rate  of  spread  of  infection  depends  on  (a)  the  external 
source  S(t)  and  (b)  self-excitation,  that  is,  on  earlier-infected  nodes  (i  =  1, . . .);  these 
nodes  spread  the  infection  with  decaying  virus  strength  4>(t),  their  age  t  grows,  times 
some  constant  gi.  The  constant  gL  is  equivalent  to  the  degree  of  the  infected  node  i. 


dB(t) 

dt 


S(t)  +  Y_  m4>(t-ti) 

i/t^t 


(10.2) 


The  model  typically  assumes  that  the  gi  values  are  equal,  namely  that  all  nodes  have  the 
same  degree  ('homogeneous'  graph).  It  also  silently  assumes  that  there  are  infinite  nodes 
available  for  infection,  and  it  may  actually  diverge  to  infinity. 

Next  we  present  our  SPIKEM  model,  which  avoids  the  shortcomings  of  the  SI  and 
Hawkes  models,  and  has  several  more  desirable  properties. 


10.3  Proposed  Method 

In  this  section  we  present  our  proposed  method,  analyze  it,  and  we  provide  the  reader 
with  several  interesting  -at  least  in  our  opinion-  observations. 

Our  model  tries  to  capture  the  following  behaviors,  that  we  observed  with  several  of 
our  real  data 

•  PI:  power-law  fall  pattern 

•  P2:  periodicities 

and  at  the  same  time  we  want  to 

•  P3:  avoid  the  divergence  to  infinity 

that  other  models  may  have.  To  handle  P3  (divergence),  we  force  our  model  to  have  a 
finite  population,  and  adjust  the  equations  accordingly.  To  handle  PI  (power-law  fall 
pattern),  we  assume  that  the  infectivity  of  a  node  (=  popularity  of  a  blog  post)  decays 
with  the  INFLUENCE  EXPONENT,  which  we  set  at  -1.5.  The  handling  of  periodicities  is 
discussed  in  subsection  10.3.2. 

We  describe  our  model  in  steps,  adding  complexity,  and  we  start  with  the  base  model. 

Preliminaries  We  assume  there  are  N  bloggers,  and  none  of  them  is  yet  blogging 
about  the  topic  of  interest.  At  time  nb,  an  event  happens  (such  as  the  2004  Indonesian 
tsunami,  or  a  controversial  political  speech  such  as  'lipstick  on  a  pig'),  and  Sb  bloggers 
immediately  blog  about  it.  We  refer  to  this  external  event  as  a  shock,  and  nb  and  Sb  are 
the  birth-time  and  the  initial  magnitude  of  the  shock. 
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Our  model  needs  a  few  more  parameters:  the  first  is  the  quality /interestingness  of 
the  news,  which  we  refer  to  as  |3,  since  this  is  the  standard  symbol  for  the  infectivity  of  a 
virus  in  epidemiology  literature.  If  [3  is  zero,  nobody  cares  about  this  specific  piece  of 
news;  the  higher  the  value,  the  more  bloggers  will  blog  about  it. 

Finally,  we  have  the  decay  function  f(n),  which  models  how  infective /influential 
a  blog  posting  is,  at  age  n.  Standard  epidemiology  models  assume  that  f ()  is  constant 
(once  sick,  you  have  the  same  probability  of  infecting  others);  recent  analysis  has  shown 
that  the  influence  drops  with  age,  following  a  power  law. 

The  above  are  the  parameters  of  the  base  model.  Before  we  list  the  equations,  we 
want  to  briefly  mention  a  derived  quantity,  (3  *  N;  this  quantity  roughly  corresponds 
to  the  Ro  ('R-naught')  found  in  the  epidemiology  literature.  This  tells  us  the  size  of  the 
"first  burst":  if  only  one  person  was  infected,  how  many  would  be  infected  in  the  next 
time-tick?1 

In  summary,  the  scenario  we  model  is  as  follows: 

•  nothing  happens,  until  a  news-event  appears,  at  birth- time  nb. 

•  Sb  bloggers  immediately  blog  about  it. 

•  other  bloggers  visit  the  initial  Sb  (or  follow-up)  bloggers,  and  occasionally  get 
'infected'  and  blog  about  the  event,  too. 

We  also  assume  that 

•  each  blogger  blogs  at  most  once  about  the  event 

•  no  other  related  event  occurs  -  that  is,  the  shock  function  S()  has  only  one  spike. 

Without  loss  of  generality,  we  also  assume  that  once  an  un-informed  blogger  sees  an 
infected /informed  blog,  he/she  always  blogs  about  the  event  (if  he/she  blogs  with 
probability  p  <  1,  we  could  absorb  p  in  the  infectivity  factor  (3) 

Our  goal  is  to  find  an  equation  to  describe  the  number  AB  (n)  of  people  blogging 
at  time-tick  n,  as  a  function  of  n  and  of  course  the  system  parameters  (total  number  of 
bloggers  N,  strength  of  infection  (3  etc). 

10.3.1  Base  model  -  SpikeM-Base 

The  model  we  propose  has  nodes  (=bloggers)  of  two  states: 

•  U:  Un-informed  of  the  rumor 

•  B:  informed,  and  Blogged  about  it 

For  those  who  just  got  informed  at  time-tick  n,  we'll  use  the  symbol  AB(n),  and  we 
assume  that,  once  informed,  a  person  will  blog  about  the  rumor  immediately. 

Let  U(n)  be  the  number  of  un-informed  people  at  time  n,  and  let  AB(n)  the  number 
of  people  that  just  found  out  about  the  rumor  at  time  n,  and  blogged  immediately  about 
it. 

1yes,  it  should  be  N  —  1,  but  we  sacrifice  accuracy,  for  intuition. 


176 


Model  1  (SPIKEM-Base).  Our  base  model  is  governed  by  the  equations 


AB(n  +  1) 
U(n  +  1) 


U(n)  •  ^  (AB(t)  +  S(t))  •  f(n  +  1  —  t)  +  e 

t=nb 

U(n)  —  AB(n  +  1) 


(10.3) 

(10.4) 


where 

f  (t)  =  (3  *  t~L5 

and  initial  conditions: 

AB(0)  =0,  U(0)  =  N 


(10.5) 


In  addition,  we  add  an  external  shock  S(u),  a  spike  generated  at  birth-time  nb.  Mathematically, 
it  is  defined  as  follows: 


S(n) 


0  (n^nb) 
Sb  (n  =  nb) 


(10.6) 


Justification  of  the  model  We  do  it  in  steps: 

•  The  term  AB(t)  +  S(t)  captures  the  count  of  bloggers  plus  external  sources,  that  got 
activated  at  time-tick  t;  their  infectivity  is  modulated  by  the  f()  infectivity  function, 
since  we  assume  that  the  infectivity  of  a  source /blogger  decays  with  time.  The 
summation  is  over  all  past  time-ticks  since  the  birth-time  nb  of  the  shock. 

•  The  infectivity  function  f()  exactly  follows  a  power  law  with  exponent  -1.5  as 
discovered  by  earlier  work  on  read  data:  real  bloggers  [LMF+07],  and  response  to 
mails  by  Einstein  and  Darwin  [Bar05]. 

•  The  meaning  of  the  summation  is  the  available  stimuli  at  time-tick  n;  the  available 
targets  are  the  un-informed  bloggers  U(n),  and  the  product  gives  the  number  of 
new  infections. 

•  We  add  a  noise  term  e  to  handle  cases  such  as  hashtag  'egypt'  on  Twitter:  some 
people  tweet  about  Egypt  anyway,  but  a  large  shock  occurred  during  the  events  in 
Tahrir  square.  Very  often,  e  ~  0. 

This  completes  the  justification  of  our  base  model. 

We  also  mention  some  facts  that  our  model  obeys:  by  definition 


B(n)  =  L  AB(t) 

t=o 


and  of  course  we  have  the  invariant 

B(n)  +  U(n)  =  N 

where  N  is  the  total  number  of  people /bloggers. 
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Symbol 

Definition 

N 

total  population  of  available  bloggers 

nd 

duration  of  sequence 

n 

time-tick  (n  =  0, . . . ,  ud) 

U(n) 

count  of  un-informed  bloggers 

B(n) 

count  of  informed  bloggers 

AB(n) 

delta:  count  of  informed  bloggers 

at  exactly  time  n 

f(n) 

infectiveness  of  a  blog-post,  at  age  n 

13 

strength  of  infection 

(3  *  N 

"first-burst"  size  of  infection 

S(n) 

volume  of  external  shock  at  time  n 

nb 

starting  time  of  breaking  news 

sb 

strength  of  external  shock  at  birth  (time  nb) 

e 

background  noise 

Pa 

strength  of  periodicity 

Pp 

period 

Ps 

phase  shift  of  periodicity 

Table  10.2:  Symbols  and  definitions 


10.3.2  With  periodicity  -  SpikeM 

Bloggers  may  modulate  their  activity  following  a  daily  cycle  (or  weekly,  or  yearly).  For 
example,  among  the  U(n)  uninformed  bloggers  at  time  n,  a  fraction  of  them  are  not 
paying  attention  (say,  because  they  are  tired  or  asleep).  How  can  we  reflect  this  in  our 
equations?  We  propose  an  answer  below,  and  then  we  provide  the  justification. 

Model  2  (SpikeM).  Wc  can  capture  the  periodic  behavior  of  bloggers  with  the  following  equa¬ 
tions : _ 


AB(n  +1)  =  p(n  +  1)- 

n 

(u(n)  •  Y_  (AB  W  +  s  W)  •  f(n  +  1  -  t)  +  e)  (10.7) 

t=nb 


p(n)  =  1  -  lPQ^sin(|^(n+  Ps))  +  l)  (10.8) 

where  U(n),  S  (t)  and  f(n)  are  defined  in  Model  1. 

Justification  The  model  is  identical  to  SPlKEM-base,  with  the  addition  of  the  periodic¬ 
ity  factor  p(-).  This  captures  the  fact  that  bloggers  tone  down  their  activity,  say,  during 
the  night,  or  even  stop  it  altogether.  The  idea  is  that  U(-)  is  the  count  of  victims  available 
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for  infection,  and  the  summation  is  the  number  of  attacks.  Under  normal  circumstances, 
each  victim-attack  pair  would  lead  to  a  new  victim;  however,  since  the  victims  are  not 
paying  full  attention  (tired/ asleep),  the  attacks  are  not  so  successful,  and  thus  we  prorate 
them  by  the  p  ( )  periodic  function. 

•  Pp  stands  for  the  period  of  the  cycle  (say,  24  hours). 

•  Ps  stands  for  the  phase  shift:  if  the  peak  activity  is  at  noon,  and  the  period  is  Pp=24 
hours,  then  Ps=18. 

•  PQ  depends  on  the  amplitude  of  the  fluctuation,  and  specifically  it  gives  the  relative 
value  of  the  off-time  (say,  midnight),  versus  peak  time  (say,  noon).  Thus,  if  Pa=0, 
we  have  no  fluctuation. 

10.3.3  Additional  details 

Model  extensions  We  could  easily  extend  our  model  so  that  it  has  several  shocks  as 
opposed  to  just  one  as  considered  here.  We  could  also  extend  it  to  have  multiple  cycles 
(daily,  weekly,  yearly).  We  do  not  elaborate  on  these  extensions  for  two  reasons:  (a)  for 
clarity  and  (b)  because  the  current  model  fits  real  data  very  well,  anyway. 

Learning  the  parameters  Our  model  consists  of  a  set  of  seven  parameters: 

6  =  {N,  |3,  nb,  Sb,  e,  PQ,  Ps}- 

Given  a  real  time  sequence  X(n)  of  bloggers  at  time-tick  n  (n  =  1, . .  .  ,nd),  we  use 
Levenberg-Marquardt  (LM)  [Lev44]  to  minimize  the  sum  of  the  errors: 

T^d 

D(X,0)  =  ^(X(n)  -  AB(n))2. 

n= 1 

Analysis  -  exponential  rise ,  power-law  fall  It  is  not  obvious  from  the  equations  of 
our  model,  but  its  rise  pattern  is  exponential,  while  the  fall  pattern  obeys  a  power 
law.  This  is  desirable,  because  this  behavior  seem  to  be  prevailing  in  real  data,  as  we 
show  in  Figure  10.2.  Let  nmode  denote  the  time-tick  at  which  the  wave  AB()  reached  its 
maximum  volume.  By  rise  plot  we  mean  the  plot  of  values  from  the  birth-time  nb  until 
nmode  (and  reversing  time  abs(n  —  nmode))  Th e  fall-plot  is  defined  similarly:  activity 
ABQ  versus  delay  from  the  peak  n  —  nmode.  Notice  that  there  is  a  power  law  for  the  fall, 
and  an  exponential  shape  for  the  rise.  We  also  show  the  traditional  'ST  model,  which,  as 
expected,  exhibits  exponential  behavior  for  both  rise  and  fall. 


10.4  Experiments 

To  evaluate  the  effectiveness  of  SpikeM,  we  carried  out  experiments  on  real  datasets. 
The  experiments  were  designed  to  answer  the  following  questions: 
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(b)  Rise-plot  (linear-log  scale) 
Time  n: 42, 41, ...  1 


(c)  Fall-plot  (linear-log ) 
Time  u:42,  43,  ...120 


(d)  Rise-plot  (log-log  scale)  (e)  Fall-plot  (log-log  ) 

Time  u:42, 41, ...  1  Time  u:42,  43,  ...120 

Figure  10.2:  Fitting  results  of  SpikeM  vs.  SI  for  pattern  Cl  in  Figure  10.1.  The  original  sequence 
(in  gray  circles),  and  our  model  (red  line)  have  an  exponential  rise  and  a  power-law  drop;  the  SI 
model  (blue  dashed  line)  is  exponential  on  both  and  thus  unrealistic.  Top  row:  full  interval;  left 
column:  only  the  rise  part;  right  column:  only  the  'fall'  part. 
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•  Ql:  Can  we  explain  the  cluster  centers  of  K-SC? 

•  Q2:  How  well  do  we  match  MemeTracker  data? 

•  Q3:  How  does  it  compare  with  other  data? 

•  Q4:  How  well  do  we  forecast  future  patterns? 


Dataset  description  We  performed  experiments  on  the  following  three  real  datasets. 


•  MemeTracker:  This  dataset  covers  three  months  of  blog  activity  from  August  1  to 
October  3  1  20082,  It  contains  short  quoted  textual  phrases  ("memes"),  each  of  which 
consists  of  the  number  of  mentions  over  time.  We  choose  1,000  phrases  in  blogs 
with  the  highest  volume  in  a  7-day  window  around  their  peak  volume. 

•  Tzvitter:  We  used  more  than  7  million  Twitter3  posts  covering  an  8-month  period 
from  June  2011  to  January  2012.  We  selected  the  10,000  most  frequently  used 
hashtags. 

•  GoogleTrends:  This  dataset  consists  of  the  volume  of  searches  for  various  queries 
(i.e.,  words)  on  Google4.  Each  query  represents  the  search  volumes  that  are  related 
to  keywords  over  time. 


10.4.1  Ql:  Explaining  K-SC  clusters 

The  results  on  this  dataset  were  already  presented  in  section  10.1  (see  Figure  10.1).  Our 
model  correctly  captures  the  six  patterns  of  K-SC.  Table  10.3  gives  a  further  description 
of  the  SPIKEM  fitting.  Our  model  consists  of  seven  parameters,  each  of  which  describes 
the  behavior  of  spikes.  Note  that  the  total  populations  N  are  almost  the  same  for  all 
patterns,  (around  2,000  to  3,000).  This  is  because  these  six  patterns  are  scaled  on  the 
y-axis  so  that  they  all  have  a  peak  volume  of  100.  We  can  see  that  (3  *  N  is  between 
0.7  — 1.0  for  these  six  patterns.  We  also  see  that  Pattern  C3  has  an  extreme  shock  Sb  =  114 
at  time  nb  =  40,  which  means  that  this  spike  is  strongly  affected  by  the  external  burst  of 
activity  (see  Figure  10.1  (c)).  On  the  other  hand,  Patterns  C4-C6  have  several  peaks  about 
24  hours  apart  with  a  strength  Pa  ~  0.4. 

We  also  evaluated  our  fitting  accuracy  by  using  the  root  mean  square  error  (RMSF) 
between  estimated  values  and  real  values:  RMSE  =  ^/-A  (X(n)  —  AB(u)  )2.  Table  10.4 

shows  the  fitting  accuracy  result  for  six  patterns  of  K-SC.  We  compared  SPIKEM  with  SI 
model.  As  discussed  in  section  10.3  (see  Figure  10.2),  SI  cannot  model  the  tail  parts  of 
the  spikes.  On  the  other  hand,  our  solution,  SpikeM  achieves  high  accuracy  for  every 
pattern  of  K-SC. 


2http:/  /  memetracker.org/ 

3http:/  /  twitter.com/ 

4http:  /  /  www.google.com/insights/search/ 
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Cl 

C2 

C3 

C4 

C5 

C6 

N 

2407 

1283 

1466 

3079 

4183 

3435 

(3  *N 

0.95 

1.00 

0.86 

0.92 

0.79 

0.69 

nb 

26 

17 

40 

35 

0 

34 

sb 

4.73 

0.06 

114.13 

23.24 

2.58 

45.58 

e 

0.36 

0.01 

0.43 

1.48 

0.32 

13.97 

Pa 

0.18 

0.06 

0.22 

0.38 

0.28 

0.39 

Ps 

12 

5 

7 

6 

2 

2 

Table  10.3:  The  model  parameters  of  our  SpikeM  best  fitting  on  six  patterns  of  K-SC  (see 
Figure  10.1). 


Pattern 

Cl 

C2 

C3 

C4 

C5 

C6 

SpikeM 

1.84 

1.61 

0.97 

4.08 

3.33 

5.89 

SI 

15.64 

6.78 

19.65 

25.29 

20.36 

21.76 

Table  10.4:  Fitting  accuracy  of  SI  vs.  SpikeM  on  six  patterns  of  K-SC.  SpikeM  consistently 
outperforms  SI  with  respect  to  accuracy  (RMSE)  between  the  original  values  and  the  models. 


10.4.2  Q2:  Matching  MemeTracker  patterns 

Figure  10.3  shows  the  results  of  model  fitting  on  the  MemeTracker  dataset.  We  selected 
six  typical  sequences  according  to  the  K-SC  clusters.  That  is,  each  sequence  corresponds 
to  each  pattern  (C1-C6).  We  show  the  original  sequences  (black  dots)  and  SPIKEM  fitting, 
AB(n)  (red  line)  in  both  linear-linear  (top)  and  log-log  (bottom)  scales.  In  the  log-log 
scale,  we  also  show  the  count  of  un-informed  bloggers,  U(n).  In  Figure  10.3,  the  bottom 
table  shows  the  short  phrases  (memes)  of  each  sequence.  All  of  the  phrases  are  sourced 
from  U.S.  politics  in  2008.  We  obtained  several  observations  for  each  sequence: 

•  Patterns  Cl  and  C2:  almost  the  same  size  of  population,  N  ~  500,  except  that  C2  has 
a  quicker  rise  and  fall  (i.e.,  stronger  infection,  (3  *  N  =  1.4)  than  Cl  ((3  *  N  =  0.94). 

•  Pattern  C3:  this  sequence  has  a  sudden  rise  and  a  power  law  decay.  There  is  a 
slight  daily  periodicity. 

•  Patterns  C4  and  C5:  there  are  clearly  daily  periodicities.  Pattern  C6,  "lipstick  on  a 
pig"  has  the  largest  population  of  all  six  sequences  (i.e.,  N  =  6259). 

•  Pattern  C6:  the  sequence:  "yes  we  can"  consists  of  huge  spikes  around  n  —  40,  and 
constant  periodic  noise.  This  is  because  the  bloggers  mention  this  phrase  as  Barack 
Obama's  slogan  as  well  as  with  more  general  meanings.  We  can  also  find  that  there 
are  several  extreme  points  (i.e.,  missing  values)  around  n  =  120  (see  blue  circle  in 
log-log  scale). 


182 


N  =562,  beta*N=0.94  N  =405,  beta*N=1 .42  n  =3529,  beta*N=0.81 


(a)  Pattern  Cl:  Meme  #109  (b)  Pattern  C2:  Meme  #34 


(c)  Pattern  C3:  Meme  #13 


N  =772,  beta*N=1 .04  N  =6259,  beta*N=0.73 


(d)  Pattern  C4:  Meme  #87  (e)  Pattern  C5:  Meme  #9 


N  =3234,  beta*N=0.69 


(f)  Pattern  C6:  Meme  #3 


#109 

the  most  serious  financial  crisis  since  the  great  depres¬ 
sion 

#87 

what  is  required  of  us  now  is  a  new  era  of  responsibility 

#34 

i  love  this  country  too  much  to  let  them  take  over  another 
election 

#9 

you  can  put  lipstick  on  a  pig 

#13 

hope  over  fear,  unity  of  purpose  over  conflict  and  dis¬ 
cord 

#3 

yes  we  can  yes  we  can 

Figure  10.3:  Results  of  SpikeM  fitting  on  six  patterns  from  MemeTracker  dataset.  The  figures 
show  in  both  'linear-linear '(top)  and  'log-log' (bottom)  scales.  The  bottom  table  lists  the  phrase 
("meme")  of  each  patterns. 
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10.4.3  Q3:  Matching  other  data 


N  =992,  beta*N=1 .41 


N  =6475,  beta*N=2.00 


N  =1266,  beta*N=1.41 


N  =992,  beta*N=1 .41 


(a)  #assange 


(b)  #stevejobs  (c)  #arresteddevelopment 


Figure  10.4:  Results  of  SpikeM  fitting  on  three  hashtags  from  Twitter  dataset.  The  top  and  bottom 
rows  show  in  linear-linear  scale,  and  log-log  scale,  respectively. 


We  also  demonstrate  the  effectiveness  of  our  model  for  other  types  of  spikes. 

Fitting  on  Twitter  data.  Figure  10.4  describes  our  fitting  results  on  the  hashtags  of 
Twitter  data.  In  this  figure,  we  can  see  that  Twitter  data  behave  similarly  to  MemeTracker 
data.  Due  to  space  limitations,  we  show  only  three  major  hashtags.  Note  that  the  top 
and  bottom  rows  are  in  linear-linear  and  log-log  scales,  respectively.  Our  model  captures 
the  following  characteristics:  (a)  #assange:  this  is  a  topic  about  Julian  Assange,  the 
founder  of  WikiLeaks.  There  are  several  mentions  before  the  peak  point  (December 
5,  2011).  (b)  #stevejobs:  there  is  a  sudden  peak  on  October  5,  2011,  with  a  long  heavy 
tail  (see  Figure  10.4(b)  in  log-log  scale).  This  was  caused  by  the  death  of  Steve  Jobs,  (c) 
#arresteddevelopment:  this  a  topic  about  the  movie  "Arrested  Development".  There  is  a 
clear  daily  periodicity  with  a  peak  point. 

Fitting  on  GoogleTrend  data.  We  can  also  observe  influence  propagation  in  queries 
on  internet  search  engines.  Figure  10.5  shows  two  different  types  of  spikes  on  Google- 
Trends.  For  an  external  catastrophic  event  (a)  "tsunami",  we  see  that  there  is  a  super  quick 
rise  immediately  after  the  Indian  Ocean  earthquake  and  tsunami  in  2005.  In  contrast,  (b) 
"harry  potter"  has  a  slower  rise,  which  is  because  this  spike  was  generated  by  "word- 
of-mouth"  activity  surrounding  the  release  of  a  Flarry  Potter  movie  in  2007.  SPIKEM 
evidently  captures  both  types  of  spikes  successfully. 
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Time 


(b)  "Harry  Potter"  (2007) 


Figure  10.5:  SpikeM  fitting  on  GoogleTrends  dataset:  the  volume  of  searches  for  the  keyword  (in 
black  dots)  and  fitting  results  (in  red  lines).  Note  that  the  window  size  is  per  week. 


10.4.4  Q4:  Tail-part  forecasts 


So  far  we  have  seen  how  SPIKEM  captures  the  pattern  dynamics  for  various  spikes. 
Here,  we  answer  a  more  practical  question:  given  the  first  part  of  the  spike,  how  can  we 
forecast  the  future  behavior  of  the  tail  part?  Figure  10.6  shows  results  of  our  forecasts 
on  MemeTracker  data.  We  selected  two  the  highest  population  phrases  (#9  and  #13  in 
Figure  10.3).  We  trained  our  models  by  using  the  values  obtained  over  a  period  of  54 
hours  (solid  black  lines  in  the  figure),  and  then  forecasted  the  following  days  (solid  red 
lines,  about  five  days).  Note  that  the  vertical  axis  uses  a  logarithmic  scale.  We  compared 
SPIKEM  with  the  auto  regressive  model  (AR).  For  a  fair  comparison,  we  used  seven 
regression  coefficients,  which  was  the  same  size  as  our  model  parameters. 

Our  method  achieves  high  forecasting  accuracy  while  AR  failed  to  forecast  future 
patterns.  More  specifically,  the  reconstruction  errors  of  SPIKEM  are  RMSE  =  9.26  and 
8.93  for  #9  and  #13,  while  AR  has  errors  of  13.98  and  14.19.  Similar  trends  are  observed 
in  other  phrases,  however  we  omit  the  results  due  to  space  limitations.  More  importantly, 
our  model  can  forecast  the  rise  part  of  spikes  as  well  as  the  tail  part  (discussed  in 
Section  10.5). 


10.5  Discussion  -  SpikeM  at  work 

Our  proposed  model,  SpikeM  is  capable  of  various  applications.  Here,  we  describe 
important  applications  and  show  some  usefulness  examples  of  our  approach. 

10.5.1  "What-if"  forecasting 

We  have  discussed  tail-part  forecasting  in  subsection  10.4.4.  Ideally,  we  want  to  forecast 
not  only  the  tail-part,  but  also  the  rise-part  of  a  spike.  This  is  much  more  difficult,  because 
we  usually  have  very  few  points  in  the  rise-part  of  a  spike.  However,  if  this  is  a  repeating 
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Figure  10.6:  Results  of  tail-part  forecasting  on  MemeTracker  data.  We  train  spikes  from  n  =  0  to  54, 
and  then,  start  forecasting  at  time  n  =  54.  Our  SpikeM  reflects  reality  better,  while  AR  quickly 
converges  to  the  zero. 


event,  like,  say,  the  spikes  induced  by  'Harry  Potter'  movies  releases,  can  we  forecast 
future  spikes  if  we  know  the  release  date  of  the  next  movie?  It  turns  out  that  our  SpikeM 
model  can  help  with  this  (difficult)  task,  too. 

Thus,  the  problem  we  address  in  Figure  10.7  is  as  follows:  we  are  given  (a)  the  first 
spike  in  2009,  "Harry  Potter  and  the  Half-Blood  Prince"  (n  =  185);  (b)  the  release  dates 
of  the  two  sequel  movies  (blue  text  with  as  arrows  pointed  at  n  =  255  and  289),  and  (c) 
the  access  volume  before  the  release  dates  (and  specifically  from  8  to  2  weeks  before). 
Can  we  forecast  the  rise  and  fall  shapes  of  upcoming  spikes  and  their  peak  points? 

Solution  and  results.  SpikeM  can  predict  the  potential  population  N  of  users  who 
are  interested  in  "Harry  Potter",  and  the  strength  of  'word-of-mouth'  infection:  (3.  Our 
solution  is  to  assume  that  these  values  are  fixed  for  all  of  the  sequel  spikes.  The  only 
difference  is  the  strength  of  the  "external  shock",  i.e.,  nb  and  Sb.  Our  solution  consists  of 
the  following  three-step  process: 


1.  Train  the  parameter  set  9  by  using  the  first  spike  (solid  black  line  in  the  figure). 

2.  With  the  fixed  parameters  9,  infer  the  new  values  of  nb  and  Sb  by  using  the 
beginning  part  of  the  next  spike  (blue  lines  between  double  arrows  at  n  =  250  and 
280). 

3.  Generate  the  spikes  using  9  and  nb  and  Sb  (red  lines). 


In  conclusion.  Figure  10.7  shows  that  our  model  successfully  captures  the  two  sequel 
spikes  and  peak  points  nmode- 
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Figure  10.7:  Results  of  "what-if "  forecasting  for  the  Harry 
Potter  series.  We  trained  parameters  by  using  (a)  the 
first  spike  around  July  15,  2009  (black  solid  line),  and  (b) 
access  volume  two  months  before  the  release  (blue  lines 
with  double  arrows  around  time  n  =  250,  280)  and  then, 
forecasted  the  following  two  spikes  (red  lines). 


Figure  10.8:  Outlier  detection  on 
GoogleTrends  dataset  (in  log-log 
scale).  Notice  that  the  biggest  spike, 
"world  marks  tsunami  anniversary" 
occurred  after  one  year  (i.e.,  52 
weeks  later). 


10.5.2  Outlier  detection 

Since  SpikeM  has  a  very  high  fitting  accuracy  on  real  datasets  (described  in  section  10.4), 
another  natural  application  would  be  anomaly  detection.  Figure  10.8  shows  the  fitting 
result  of  Figure  10.5  (a),  in  a  log-log  scale.  Note  that  the  black  circles  are  the  original 
sequence,  and  the  pink  line  is  our  model  fitting.  We  can  visually  observe  that  there  are 
several  points  that  do  not  overlap  the  model.  For  example,  (a)  on  March  29,  there  is 
one  spike,  since  another  earthquake  occurred  on  March  28.  (b)  There  is  a  huge  spike  on 
December  26,  2005,  which  is  exactly  one  year  after  the  Indian  Ocean  earthquake. 

10.5.3  Reverse  engineering 

Most  importantly,  our  model  can  provide  an  intuitive  explanation  such  as  the  potential 
number  of  interested  bloggers,  and  the  quality  of  news.  Here  we  report  our  discoveries 
on  MemeTracker  and  Twitter  datasets  (see  Figure  10.9). 

Observation  10.1  (Total  population  of  bloggers).  The  total  populations  of  potential  blog¬ 
gers/users  N  are  almost  same  for  both  datasets  (around  N  =  1,000  —  2, 000). 

We  also  note  that  they  are  skewed  to  the  right,  i.e.,  there  is  a  long  tail  of  larger  values. 

Observation  10.2  (Strength  of  first  infection).  The  strength  of  the  "first  burst "  is  (3  *  N  ~  1.0 
for  each  dataset. 

The  above  two  observations  agree  with  the  intuition:  we  can  see  common  behavior 
for  MemeTracker  and  Twitter,  which  means  that  they  have  similar  characteristics  in  terms 
of  social  activities. 
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Figure  10.9:  Reverse  engineering:  pdf  of  three  parameters:  N,  (3  *  N,  Ps  over  1,000 
memes/hashtags.  (a)  MemeTracker:  total  potential  bloggers  N  ~  1, 000,  and  strength  of  "first 
burst"  (3  *  N  ~  1.0.  More  than  90%  of  the  memes  have  clear  daily  periodicity  with  high  activities 
around  6pm  (i.e.,  Ps  ~  0).  (b)  Twitter:  similar  trends  except  more  spread  in  Ps,  possibly,  due  to 
multiple  time  zone.  Also  see  the  text  for  more  observations. 

Observation  10.3  (Common  activity  and  periodicity).  Typical  user  behavior  is  to  have  a 
daily  periodicity  with  (a)  phase  shift  Ps  =  0  (small  population  during  early  morning,  large 
population  at  peak  point,  6pm)  for  MemeTracker,  while  (b)  more  spread  in  Ps. 

Note  that  more  than  90%  of  all  spikes  have  a  daily  periodicity  in  both  datasets.  The 
only  the  difference  between  the  two  datasets  is  that  Twitter  has  several  Ps  values.  This  is 
because  Twitter  has  multiple  time  zones  (e.g.,  US,  UK,  Australia,  and  India). 


10.6  Related  Work 

We  present  the  related  work,  in  three  areas:  time  series  analysis,  influence  propagation, 
and  burst  detection. 

Time  series  Analysis.  This  is  an  old  topic,  that  has  attracted  huge  interest,  and  that 
is  dealt  with  in  well-regarded  textbooks  [BJR94],  Traditional  approaches  applied  to  data 
mining  include  Auto-Regression  (AR)  and  variations  [LLL  1 11],  or  Linear  dynamical 
systems  (LDS),  Kalman  filters  (KF)  and  variants  [JCW04,  LPF10?  ]  but  they  are  all  linear 
methods.  Non-linear  methods  for  forecasting  tend  to  be  hard  to  interpret,  because  they 
rely  on  nearest-neighbor  search  [CF02],  or  artificial  neural  networks  [WG94],  Similarity 
search,  indexing  and  pattern  discovery  in  time  sequences  have  also  attracted  huge  interest 
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[FRM94,  KS01,  GKMSOl,  KPZ+04,  LKL+04,  VKY09,  PAP+11,  SFY07,  SPF05,  MSY09],  but 
none  of  these  methods  specifically  focused  on  modeling  bursts. 

Influence  propagation.  The  canonical  text-book  for  epidemiological  models  like  SI 
is  Anderson  and  May  [AM91].  The  power-law  decay  of  influence  has  been  reported  in 
blogs  [MLF+07],  with  a  exponent  of  -1.5.  Barabasi  and  his  colleagues  reported  exponents 
of  -1  and  -1.5,  for  the  response  time  in  correspondence  [Bar05].  Analyses  of  epidemics, 
blogs,  social  media,  propagation  and  the  cascades  they  create  have  attracted  much 
interest  [LBK09,  YL10,  KMM10,  PCF+11,  PBRF12,  KKT03,  TPT+10,  GLMF09,  GLNGT04, 
GKRT04,  LAFI06],  and  recently  the  reverse  problem  ('find  who  started  it')  [LTGM10, 
SZ11], 

Burst  detection.  Remotely  related  to  our  work  are  the  efforts  to  spot  bursts.  This 
includes  the  work  of  Kleinberg  [Kle02],  the  algorithm  of  Zhu  and  Shasha  [ZS03],  and 
the  algorithm  of  Parikh  et  al.  [PS08].  None  of  the  above  gives  a  parsimonious  model  for 
describing  the  activity  in  a  network. 


10.7  Conclusions 

We  studied  the  rise-and-fall  patterns  in  information  diffusion  process  through  online 
medias  in  this  chapter.  We  presented  SpikeM,  a  general,  accurate  and  succinct  model  that 
explains  the  rise-and-fall  patterns.  Our  proposed  SpikeM  has  the  following  appealing 
advantages: 

•  Unification  power:  it  includes  earlier  patterns  and  models  as  special  cases  (K-SC, 
as  well  as  the  SI  model); 

•  Practicality:  it  matches  the  behavior  of  numerous,  diverse,  real  datasets,  including 
the  power-law  decay  and  much  more  beyond; 

•  Parsimony:  our  model  requires  only  a  handful  of  parameters; 

•  Usefulness:  we  showed  how  to  use  our  model  to  do  'short-term'  forecasting,  to 
answer  what-if  scenarios,  to  spot  outliers,  and  to  learn  more  about  the  mechanisms 
of  the  spikes. 
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Chapter  11 

Patterns  amongst  Competing  Tasks 


If  Alice  has  double  the  friends  of  Bob,  will  she  also  have  double  the  phone-calls  (or 
wall-postings,  or  tweets)?  We  show  that  the  answer  to  the  question  is  a  power-law: 
sub-linear,  or  super-linear,  for  a  wide  variety  of  diverse  settings:  tasks  in  a  phone-call 
network,  like  count  of  friends,  count  of  phone-calls,  total  count  of  minutes;  tasks  in  a 
twitter-like  network,  like  count  of  tweets,  count  of  followees  etc. 

Why  are  there  so  many  super-linear  relationships?  We  answer  this  question  through 
our  proposed  "competing  tasks"  model  (CTM),  that  leads  exactly  to  power-law  relation¬ 
ships  between  task-frequencies.  In  a  nutshell,  the  harder  the  task,  the  exponentially  less 
frequently  it  will  happen.  Our  model  is  inspired  from  population  ecology,  competing 
viruses  (Chapter  4),  using  a  generalization  of  the  famous  Volterra-Lotka  prey-predator 
model.  We  illustrate  our  observations  on  two  large,  real  datasets,  spanning  ~  2.2M  and 
~  3.1  M  individuals  with  5  features  each.  We  show  how  to  use  our  observations  to  spot 
clusters  and  outliers,  like,  e.g.,  telemarketers  in  our  phone-call  network. 


11.1  Introduction 

If  'Alice'  has  nA=50  contacts  and  did  nB=100  phone-calls  to  them,  what  should  we  expect 
for  'Bob',  who  has  twice  the  contacts?  One  would  expect  a  linear  relationship  (double 
the  contacts,  double  the  phone-calls).  However,  we  show  that  in  numerous  settings,  the 
relationship  is  a  power  law,  being  sub-  or  super-linear. 

The  questions  we  want  to  answer  here  are  the  following: 

1.  Ql:  Why:  what  is  the  reason  that  super-linear  relationships  are  so  prevalent? 

2.  Q2:  Practical  use:  Can  we  answer  'what-if'  scenarios?  Can  we  find  anomalies? 

Our  contributions  are  exactly  the  answers  to  the  above  questions: 

•  Al:  We  propose  the  CTM  model  inspired  from  population  ecology,  with  tasks 
competing  for  resources  (person's  time).  Then,  we  show  that  the  relative  frequency 
of  tasks  is  exponentially  related  to  the  task's  difficulty. 
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Figure  11.1:  Illustration  of  super-linearity:Power-law  relationship  between  count  of  tweets  and 
count  of  retweets  for  each  user  in  the  Tencent-Weibo  dataset.  Log-log  scales,  each  red  square  is 
the  conditional  average  of  tweets,  for  the  given  count  of  retweets.  The  last  few  points  are  noisy, 
because  of  extreme-value  effects. 


•  A2:  We  show  that  our  SURF  fits  several,  diverse  real  datasets,  and  present  how  to 
spot  outliers,  and  how  to  answer  what-if  questions. 


Table  11.1:  Symbols  and  definitions 


Symbol 

Definition 

cU./  cIb 

duration  of  task  'A'  &  'B' 

taTb 

reproduction  rate  of  species  A'  &  'B'  -  inverse  of  dura¬ 
tion 

nA,  nB 

population  of  species  'A'  &  'B'  =  occurrence  freq.  of  A' 
&'B' 

surf 

Super-Linear  Relative  Frequency  Observation 

CTM 

Our  Competing  Tasks  Model 

We  report  results  from  two  large,  real,  diverse  datasets.  The  first  spans  ~  3.1  M  users 
and  is  on  a  phone-call  dataset;  for  each  customer,  we  study  the  count  of  distinct  contacts, 
the  count  of  phonecalls,  the  total  minutes,  and  the  count  of  text  messages.  The  second 
is  from  Tencent-Weibo,  a  Chinese  version  of  Twitter™,  with  count  of  tweets,  re-tweets, 
followees  etc  per  user.  Figure  11.1  illustrates  our  main  idea:it  depicts  the  power-law 
relationship  between  count  of  tweets  and  count  of  retweets.  Both  axis  are  logarithmic; 
each  red  square  is  the  conditional  average  of  tweets,  for  the  given  count  of  retweets.  The 
last  few  points  are  noisy,  because  of  extreme-value  effects  (there  are  very  few  people 
with  so  many  re-tweets,  and  they  dominate  the  average). 
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Table  11.1  gives  the  major  symbols  we  use  and  their  definitions.  The  rest  of  the 
sections  are  organized  in  the  typical  way:  Proposed  explanation  and  CTM  Model; 
Experiments;  Related  work  and  Conclusions. 


11.2  Competing  Tasks  Model  (CTM) 

Here  we  focus  on  question  Q1  of  the  introduction  -  why  do  we  see  a  power-law  relation¬ 
ship  between,  say,  tweets  and  retweets  (Figure  11.1)? 

We  do  a  lot  of  tasks  during  each  day.  Let  tia  and  ng  be  the  frequencies  of  tasks  'A' 
and  'B'  respectively,  where,  say,  'A'  stands  for  'going-for-a-walk'  and  'B'  for  'clicking-on- 
a-web-link'.  Let  dA  and  dB  be  the  effort  (say,  time  duration)  that  each  task  requires.  One 
might  expect  that  if  a  task  takes  twice  as  long,  then  it  should  happen  twice  less  often: 

nA  :  tlb  dB  :  dA. 

However,  this  does  not  seem  to  agree  with  the  numerous  observations  of  the  dis¬ 
proportionate  impact  of  small  changes  in  the  effort  to  do  any  task.  For  example,  the 
huge  difference  in  organ  donorship  between  culturally  similar  Germany  and  Austria 
(12%  vs  99.9%),  has  been  attributed  to  the  difference  in  the  'default'  choices  -  "in  Austria, 
the  default  choice  is  to  be  an  organ  donor,  whereas  in  Germany  the  default  is  not  to 
be"  [Watll].  Similarly,  the  British  physicist  Stephen  Hawking,  notes  in  his  famous 
popular  science  book  'A  Brief  History  of  Time',  that  "an  editor  warned  him  that  for  every 
equation  in  the  book  the  readership  would  be  halved"  [Haw88].  Also,  if  a  web  form  has 
'opt-in'  by  default,  the  vast  majority  of  people  will  opt-in,  despite  the  fact  that  it  only 
takes  a  few  seconds  to  find  the  box  and  click  the  mouse.  We  shall  refer  to  this  observation 
as  the  Relative  Effort  observation: 

Observation  11.1  (RELATIVE  EFFORT).  A  small  change  in  the  effort  ( dA)  required  for  a  task, 
makes  a  big  difference  in  nA,  the  number  of  times  (frequency)  we  execute  this  task. 

Next  we  give  a  model  that  simultaneously  (a)  agrees  with  the  intuitive  RELATIVE 
EFFORT  observation,  and  (b)  also  explains  the  super-linear  relationship  of  frequencies 
nA  and  nB.  We  shall  refer  to  the  latter  as  SuRF:  super-linear  relative  frequency  observation. 

Specifically,  we  conjecture  that  the  relative  frequencies  nA  and  nB  follow  a  power-law, 
where  the  exponent  depends  on  the  relative  effort  (or  time  duration  dA  and  dB)  of  the 
two  tasks. 

Conjecture  11.1  (Super-linear  Relative  Frequency  (SuRF)).  With  nA,  nB  the  task  frequen¬ 
cies,  and  dA,  dB  the  task  durations,  we  have 

dA  *log(nA)  =  dB  *log(nB)  =  c  (11.1) 

or  equivalently 

nA  =  exp(c/dA)  (11.2) 
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where,  c  is  some  constant.  This  implies  that  a  small  change  in  the  effort  d-A  of  a  task 
(eg.,  through  a  better  web  interface,  requiring  fewer  clicks),  will  have  an  exponential 
impact  on  the  frequency  nA  of  the  task. 

Empirical  Evidence:  Please  see  Figure  11.2  for  numerous  pairs  of  attributes  of  our  real 
data,  where  the  "super-linear  relative  frequency"  observation  (SURF)  holds. 

11.2.1  Justification 

We  can  show  that  the  conjecture  holds,  if  we  borrow  the  concept  of  competing  species 
from  population  ecology:  suppose  we  have  two  species  of  herbivores,  'A'  and  'B',  feeding 
on  the  same  finite  resource  (say,  R  pounds  of  grass,  on  an  isolated  island).  The  species 
require  dA  and  dB  amounts  of  grass  per  individual  to  produce  one  offspring.  How  many 
members  nA  and  nB  of  each  species  would  we  expect  to  see?  This  situation  is  equivalent 
to  tasks  'A'  and  'B',  each  competing  for  user's  time,  and  each  requiring  dA  and  dB  units 
of  time  respectively. 

We  use  the  Lotka-Volterra-like  equations  [MM07]  to  model  the  competing  species  'A' 
and  'B',  with  reproductive  rate  rA  =  l/dA,  rB  =  l/dB,  with  dA,  dB  being  the  amount  of 
resource  (grass)  needed  to  create  offspring.  Then,  we  have 

0TX 

=  rAnA(R-  (dAnA  +  dBnB))  (11.3) 

ot 

0TX 

=  rBnB(R  -  (dAnA  +  dBnB))  (11.4) 

dt 

where,  is  the  time-derivative  of  the  population  of  species  A,  is  the  time-derivative 

of  the  population  of  species  B,  and  R  is  some  maximum  total  units  of  resources  for  the 
species  compete.  The  equations  above  state  that  the  rate  of  change  of  a  species,  say  A, 
is  dependent  on  its  reproductive  rate  (rA),  its  current  population  (nA),  and  the  current 
resource  units  remaining  (R  —  (dAnA  +  dBnB)).  Given  such  a  system  of  equations,  we 
can  now  prove  the  following  theorem: 

Theorem  11.1  (SuRF  (Super-linear  Relative  Frequency)).  Under  the  setting  described  in 
Equations  (11.3)  and  (11.4),  we  have 

nA(t)ocnB(t)dB/dA  (11.5) 

where,  nA(t)  and  nB(t)  are  the  populations  of  the  species  at  some  time  t  >  0. 

Proof.  By  dividing  Equations  (11.3)  and  (11.4),  we  obtain 
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where  c  is  some  constant  and  nA(0)  and  nB  (0)  are  the  initial  populations  of  the  species 
at  time  t  =  0.  □  □ 

Thus,  the  longer  a  task  takes  (dA,  dB),  the  exponentially  less  often  it  will  occur. 


11.3  Experiments 

In  the  following  subsections  we  present  the  experimental  results  that  corroborate  our 
SuRF  theory.  The  real  datasets  we  studied  in  order  to  check  the  validity  of  the  proposed 
model  are  the  following: 

•  Tencent  Weibo  (W)1:  Weibo  is  one  of  the  largest  micro-blogging  websites  in  China. 
We  studied  the  behavior  of  ~  2.2  million  users;  for  each  one  we  extracted  five 
quantities:  the  number  of  her  tweets,  retweets,  comments,  mentions  and  followees. 

•  Phonecall  dataset  (P):  This  dataset  consists  of  phone-call  records  from  a  mobile 
provider  in  a  big  Asian  city.  For  each  of  the  ~  3.1  million  customers  we  obtained 
the  number  of  her  calls,  messages,  "voice"  and  "sms"  friends,  as  well  as  the  total 
minutes  of  her  phone-calls. 

The  extracted  features  from  each  dataset  correspond  to  the  occurrences  of  the  tasks  (nA, 
nB  etc.)  in  our  SuRF  theory. 

11.3.1  CTM  at  Work 

In  Fig.  11.2  we  present  the  scatter  plots  of  pairs  of  tasks  that  differ  in  the  effort  that  they 
require. 

Observation  11.2.  The  SuRF  pattern  holds  in  the  real-world  datasets;  there  is  a  power-law 
relationship  between  tasks  of  different  difficulty. 

Here  is  a  short  explanation  of  the  scatter  plots:  each  user/ customer  is  a  blue  point  on 
the  plane  and  is  characterized  by  the  number  of  times  she  did  tasks  A  and  B.  Note  that 

1Tencent  Weibo  Dataset,  KDD  Cup  2012.  http://www.kddcup2012.org 
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Figure  11.2:  SuRF  patterns  in  real  datasets  (log-log  scale):  observe  the  power-law  relationship 
between  tasks  of  different  difficulty.  Note  that  for  each  dataset  we  have  n  =  5  features.  We  are 
giving  the  plots  for  n  —  1  pairs,  instead  of  (£).  ("W":  Tencent  Weibo,  and  “P":  Phonecall  data) 
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all  plots  "suffer"  from  heavy  over-plotting,  especially  for  small  values  of  occurrences 
(not  visible  in  the  plots). 

Linear  regression  in  this  cloud  of  points  fails  due  to  over-plotting,  and  so  we  resort 
to  the  following  solution:  we  group  the  points  in  logarithmic  buckets  and  compute  the 
mean  (red  points)  of  Y  given  X.  The  line,  E[Y|X  =  x],  is  obtained  by  linear  regression 
on  the  red  points  (ignoring  the  few  last  points,  where  the  observations  are  extremely 
sparse,  possibly  due  to  the  "horizon  effect").  As  we  observe,  in  all  cases  the  conditional 
expectation  is  a  linear  function  of  x. 

Observation  11.3.  Deviations  from  the  SuRF  patterns,  as  shoivn  in  Fig.  11.2(P2)  are  due  to 
outliers,  e.g.  telemarketers. 

The  customers  within  the  cyan  ellipse  all  have  about  100  contacts,  and  100  phone- 
calls,  that  is  about  one  phone-call  per  contact.  The  rest  of  the  population  has  many  more 
phone-calls  than  contacts,  leading  to  the  suspicion  that  the  former  are  telemarketers. 

11.4  Related  Work 

Population  Ecology:  Population  ecology  is  an  area  of  its  own,  with  numerous  books 
and  milestone  papers  [MM07],  The  competing  species  analysis  is  discussed,  among 
others,  in  [Ste09],  along  with  the  full  extension  of  the  famous  Lotka-Volterra  model  for 
prey-predator  competition.  Recall  that  we  analyzed  (Chapter  4)  a  probabilistic  competi¬ 
tion  model  for  competing  diseases  with  'winner-takes-all'  phenomenon  [PBRF12], 

Power  laws:  Power  laws  have  been  discovered  in  numerous  cases,  often  in  conjunction 
with  fractals  and  self-similarities  [Sch91].  Some  of  the  most  famous  power  laws  are  the 
Zipf  distribution  [Zip49]  and  the  Pareto  distribution  [Par96].  They  have  negative  slopes, 
though.  Power  laws  with  positive  slopes  have  also  been  discovered  (length  of  coastlines, 
number  of  quad-tree  blocks  versus  granularity  [FG96]),  and  more  recently  in  graphs:  the 
number  of  edges  grows  super-linearly  on  the  number  of  nodes  [LKF05],  with  exponents 
1.6  or  1.2;  the  number  of  participating  triangles  of  a  node  grows  super-linearly  to  the 
degree  [Tso08],  with  exponent  -  1.5;  the  total  weight  of  a  node  is  superlinear  on  the 
degree  [MAF08],  to  name  a  few. 

However,  none  of  the  above  articles  provided  a  solution  to  our  setting,  namely, 
an  explanation  for  the  super-linearity  we  observe,  and  validation  on  several,  diverse 
datasets. 

11.5  Conclusions 

The  contributions  are  the  answers  to  the  questions  we  posed  in  the  introductions:  Ql: 
why  do  we  see  super-linear  relationships  between  counts  of  tasks;  Q2:  how  can  we  put  it 
to  practical  use.  Specifically,  the  contributions  are: 


196 


1.  Al:  Discovery  and  explanation  of  power  law  (CTM)  in  several,  real,  diverse 
datasets,  on  most  of  their  n-choose-2  pairs  of  attributes /tasks. 

2.  A2:  Illustration  that  SURF  can  be  used  for  anomaly  detection,  clustering  and 
what-if  scenarios. 

Future  work  may  include  developing  more  expressive  models  to  account  for  the 
variance  in  the  observations. 
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Part  IV 
Conclusion 
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Chapter  12 

Conclusions  and  Future  Directions 


Networks  are  really  ubiquitous,  providing  a  simple  yet  powerful  abstraction  to  tackle 
several  real-world  problems.  In  addition  though,  when  considering  large  graphs,  epi¬ 
demics  are  also  everywhere.  For  social  networks,  infectious  diseases  like  the  flu  are  prime 
examples,  but  hypes/ memes  are  similarly  epidemic  in  nature;  whether  it  is  friends  dis¬ 
cussing  that  latest  gadget  or  phone,  or  sharing  a  funny  video,  there  are  nodes  'infecting' 
each  other.  Similarly,  a  computer  virus  can  cause  an  epidemic  in  a  computer  network,  as 
can  a  contaminant  in  a  water  distribution  network.  Applications  include  cyber  security, 
epidemiology  and  public  health,  product  marketing  to  information  dissemination. 

Many  fundamental  questions  underlie  the  propagation-like  processes  in  all  these 
domains,  a  number  of  which  we  addressed  in  this  thesis.  Many  of  these  questions  are  also 
inter-connected,  and  as  this  thesis  demonstrates,  answering  some  of  them  can  be  crucial 
for  others.  We  tackled  multiple  important  questions,  like  understanding  the  tipping 
point  behavior  of  epidemics,  predicting  who-wins  among  competing  viruses /products, 
developing  effective  algorithms  for  immunization  and  marketing  for  several  real-world 
settings,  and  building  more  realistic  online  information-diffusion  models  while  analyzing 
numerous  real-datasets. 

12.1  Summary  of  contributions 

We  summarize  the  major  contributions  and  impact  of  this  thesis  next.  Answering 
problems  spanning  multiple  domains  necessitates  a  multi-pronged  approach.  Hence,  the 
contributions  span  three  inter-dependent  areas: 

I  (Theory)  Discovering  the  importance  of  eigenvalues  and  winner-takes-all  phenomena 
(Chapters  2, 3, 4,  5) 

II  (Algorithms)  Orders  of  magnitude  faster  and  substantially  more  effective  algorithms  for 
immunization,  and  cidprits  detection  (Chapters  6,  7,  8, 9) 

III  (Models)  More  expressive,  unifying,  interpretable  and  predictive  models  using  numerous 
real-world  datasets  (Chapters  10, 11) 
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We  are  arguably  the  first  to  present  a  systematic  study  of  propagation  and  immuniza¬ 
tion  of  single  as  well  as  multiple  viruses  on  arbitrary,  real  and  time-varying  networks  as 
the  vast  majority  of  the  literature  focuses  on  structured  topologies,  cliques,  and  related 
un-realistic  models. 

Theory 

•  Eigenvalues  for  Epidemic  Threshold:  We  showed  that  the  epidemic  threshold  for 
single  viruses  depends  on  the  largest  eigenvalue  of  an  'appropriate'  matrix  for 
arbitrary  static  and  time-varying  graphs.  Our  eigenvalue  result  generalizes  and 
unifies  previous  results  and  has  broad  implications  and  applications  like  faster 
epidemiological  simulations.  We  are  the  first  to  show  the  threshold  on  arbitrary 
static  graphs  and  almost  any  virus  propagation  model.  In  addition,  ours  is  the  first 
closed  formula  for  any  set  of  arbitrary  time-varying  graphs. 

•  Winner-Takes-All  for  Competing  Viruses:  We  are  the  first  to  show  the  surprising 
'winner-takes-all'  phenomenon  involving  competing  viruses /products,  spread¬ 
ing  on  arbitrary  graphs.  Additionally,  we  are  the  first  to  extend  this  problem  to 
mutually-interacting  viruses  and  show  a  phase-transition. 

•  Impact:  Our  paper  on  epidemic  thresholds  on  static  graphs  [PCF+11]  was  selected 
for  one  of  the  best  papers  of  the  conference.  Our  results  have  been  used  to  enable 
other  important  tasks  like  anomaly  detection  and  graph  modeling  [AMF10],  and 
immunization  (see  Part  II  of  this  thesis).  They  are  also  being  incorporated  into 
FRED,  an  epidemiological  simulator  developed  by  MIDAS. 

Algorithms 

•  Dramatically  better  Immunization:  Our  carefully  designed  algorithms  such  as 
NETSHIELD  and  SMART- ALLOC  substantially  improved  the  state-of-the-art  in  solv¬ 
ing  the  complete  and  fractional  immunization  problems  respectively.  NETSHIELD 
outperformed  many  methods  by  more  than  7  orders  of  magnitude  in  running  time, 
and  competitors  like  the  well-known  acquaintance  immunization  in  quality  of 
solutions.  Using  task-specific  structure  Smart-Alloc  achieves  up  to  6x  fewer 
infections  and  30,000x  speed-up,  over  current  practice  and  ad-hoc  heuristics  on 
real  hospital  patient-transfer  networks  like  US-MEDICARE  and  PENN- ALL  (the 
current  practice  has  been  largely  focused  within  individual  hospitals). 

•  Parameter-free  Culprits  detection:  Our  algorithm  NETSLEUTH  is  the  first  linear¬ 
time  algorithm  (in  edges  and  nodes)  for  automatically  identifying  the  set  of  cul¬ 
prits  which  best  describes  a  given  snapshot  of  the  epidemic  (both  in  identity  and 
number) — in  contrast  to  the  state  of  the  art  (which  are  at  least  quadratic  in  running 
time,  and  require  the  number  of  seeds  as  input). 

•  Impact:  Our  results  and  algorithms  have  been  incorporated  into  undergraduate 
courses  (UPitt  Summer  Program)  and  slides  sought-after  for  graduate  courses 
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(Xifeng  Yan,  UCSB)  in  universities,  and  have  appeared  in  ACM  Crossroads. 


Models 

•  Unifying  models  for  online  diffusion:  We  developed  SpikeM,  a  powerful  suc¬ 
cinct  model  to  explain  the  rise  and  fall  patterns  of  information  diffusion  which 
unifies  and  includes  earlier  patterns  and  models,  matches  behavior  of  numerous 
real  datasets  and  can  be  used  to  forecast,  answer  'what-if'  scenarios  and  even 
reverse-engineer  epidemics. 

•  Explaining  power-laws  in  competing  tasks:  We  developed  CTM,  an  intuitive 
model  to  explain  the  prevalence  of  super-linear  relationships  between  the  frequen¬ 
cies  of  various  competing  tasks  observed  in  real-datasets,  and  use  it  to  spot  outliers 
like  telemarketers. 


12.2  Vision  and  Future  directions 

Looking  ahead,  the  long-term  goal  is  to  solve  large  real-world  problems  by  building 
models  and  understanding  the  dynamics  of  networked  systems,  be  it  social,  technological, 
or  natural.  To  solve  high-impact  problems  involving  complex  systems,  our  vision  is  a 
multi-pronged  'end-to-end'  research  pipeline:  (a)  defining  and  collecting  rich  network 
structure  (collaborating  with  domain  experts),  (b)  understanding  network  properties 
and  then  abstracting/ analyzing  real-life  processes  on  these  networks,  and  (c)  developing 
and  implementing  efficient  algorithms  to  manipulate  such  processes  for  our  benefit. 

This  thesis  has  made  several  major  steps  in  these  broad  directions — we  now  better 
understand  the  effect  of  graph  topology  on  various  propagation  processes  (like  epidemic 
thresholds),  have  better  algorithms  for  many  immunization  and  marketing  tasks  (like 
SMART-ALLOC),  and  have  better  models  for  describing  propagation  scenarios  (like 
SpikeM).  Several  challenges  remain,  some  of  which  we  describe  next. 

12.2.1  Long  Term  Challenges 

Scalability:  One  of  the  major  thrust  areas  would  be  continuing  focusing  on  big-data 
and  computing  systems.  We  have  already  used  data-intensive  map-reduce  type  archi¬ 
tectures  (Hadoop)  and  distributed  compute-intensive  systems  (Condor)  for  our  current 
research.  Such  systems  would  not  only  allow  us  to  perform  large-scale  analysis  and  de¬ 
velop  deployable  solutions  for  real  problems,  but  may  also  serve  as  a  source  of  interesting 
problems,  e.g.,  how  do  failures  cascade  in  Hadoop  clusters? 

Richer  settings  &  dynamic  parameters:  What  happens  when  the  meme /virus  evolves 
over  time,  e.g,  a  flu-like  virus  becomes  a  mumps-like  virus?  What  changes,  if  this  hap¬ 
pens  in  an  adversarial  manner,  in  a  game-theoretic  fashion?  We  studied  what  happens 
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when  the  underlying  network  changes  with  time  in  this  thesis.  We  wish  to  further  study 
changes  in  other  parameters  like  the  evolution  of  viruses  and  memes  and  effect  of  sudden 
introduction  of  other  viruses /topics /products  into  the  ecosystem.  Similarly,  how  do 
node  attributes,  like  gender  and  age,  affect  the  distribution  of  vaccines?  How  important 
are  weak  social  ties  (as  opposed  to  strong  friendship)  in  dynamical  phenomena?  How 
does  additional  uncertainty  in  edges  change  our  answer  to  the  culprits  problem?  Most 
existing  models  and  algorithms  work  on  plain  graphs,  where  all  nodes  and  edges  are 
the  same.  Incorporating  richer  attributed  network  data — having  auxiliary  and  uncertain 
features  like  historical  attributes,  textual  information,  geographical  information,  can 
greatly  enhance  models  as  well  as  algorithms. 

Finally,  we  believe  our  inter-disciplinary  approach  is  also  vital  here  and  has  indeed  led 
to  many  discoveries  in  this  thesis,  spanning  areas  like  public  health,  social  media  and 
networking.  We  also  need  to  use  tools  from  Data  Mining,  Machine  Learning,  Statistical 
Physics  and  Biology  to  build,  and  analyze  the  models  and  mechanisms.  And  all  of  these 
have  to  be  done  in  the  context  that  we  are  trying  to  both  understand  and  manage  real-life 
processes  which  are  on  a  societal-scale.  These  are  pretty  exciting  times  for  research  in 
networks. 
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